run Tika GUI via Nutch
Hello We can run Tika GUI by running Nutch source in Eclipse, because we are allowed to run org.apache.tika.gui.TikaGUI class in Run Configuration. But Is there any solution to run Tika GUI via the application mode of Nutch? I changed some codes of Tika and want to test whether Nutch is true with this changes or no.
Re: run Tika GUI via Nutch
ّI solved the problem. On Tue, Nov 1, 2011 at 3:24 PM, Ahmad Ajiloo ahmad.aji...@gmail.com wrote: Hello We can run Tika GUI by running Nutch source in Eclipse, because we are allowed to run org.apache.tika.gui.TikaGUI class in Run Configuration. But Is there any solution to run Tika GUI via the application mode of Nutch? I changed some codes of Tika and want to test whether Nutch is true with this changes or no.
[jira] [Created] (NUTCH-1187) Port NUTCH-1028 to nutchgora - log parser keys
Port NUTCH-1028 to nutchgora - log parser keys -- Key: NUTCH-1187 URL: https://issues.apache.org/jira/browse/NUTCH-1187 Project: Nutch Issue Type: Sub-task Components: parser Reporter: Ferdy Priority: Trivial This task is to port NUTCH-1028 to nutchgora - log parser keys. Very trivial, will attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1104) Port issues from 1.x to trunk
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141101#comment-13141101 ] Ferdy commented on NUTCH-1104: -- Ok. Btw could you rename this issue to reflect the recent trunk to nutchgora branch move? Port issues from 1.x to trunk - Key: NUTCH-1104 URL: https://issues.apache.org/jira/browse/NUTCH-1104 Project: Nutch Issue Type: Task Affects Versions: nutchgora Reporter: Markus Jelsma Fix For: nutchgora A new issue to track issues that have not yet been ported from 1.x to trunk: NUTCH-987 NUTCH-1028 NUTCH-1036 NUTCH-1057 NUTCH-1067 NUTCH-1101 NUTCH-1102 NUTCH-1105 NUTCH-940 NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1104: - Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the NutchGora branch. Please mark ported issues by modifying this description. NOT YET PORTED: * NUTCH-987 Support HTTP auth for Solr communication * NUTCH-1028 Log parser keys * NUTCH-1036 Solr jobs should increment counters in Reporter * NUTCH-1057 Make fetcher thread time out configurable * NUTCH-1067 Configure minimum throughput for fetcher * NUTCH-1101 Options to purge db_gone records in updatedb * NUTCH-1102 Fetcher, rely on fetcher.parse directive only * NUTCH-1105 MaxContentLength option for index-basic * NUTCH-940 Statis field plugin * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk PORTED: * No issues yet NOT GOING TO BE PORTED: * No issues, explain why it should not be ported was: A new issue to track issues that have not yet been ported from 1.x to trunk: NUTCH-987 NUTCH-1028 NUTCH-1036 NUTCH-1057 NUTCH-1067 NUTCH-1101 NUTCH-1102 NUTCH-1105 NUTCH-940 NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk Summary: Port issues from trunk NutchGora branch (was: Port issues from 1.x to trunk) Port issues from trunk NutchGora branch --- Key: NUTCH-1104 URL: https://issues.apache.org/jira/browse/NUTCH-1104 Project: Nutch Issue Type: Task Affects Versions: nutchgora Reporter: Markus Jelsma Fix For: nutchgora Umbrella issue for tracking issues that should be ported from 1.x trunk to the NutchGora branch. Please mark ported issues by modifying this description. NOT YET PORTED: * NUTCH-987 Support HTTP auth for Solr communication * NUTCH-1028 Log parser keys * NUTCH-1036 Solr jobs should increment counters in Reporter * NUTCH-1057 Make fetcher thread time out configurable * NUTCH-1067 Configure minimum throughput for fetcher * NUTCH-1101 Options to purge db_gone records in updatedb * NUTCH-1102 Fetcher, rely on fetcher.parse directive only * NUTCH-1105 MaxContentLength option for index-basic * NUTCH-940 Statis field plugin * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk PORTED: * No issues yet NOT GOING TO BE PORTED: * No issues, explain why it should not be ported -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1187) Port NUTCH-1028 to nutchgora - log parser keys
[ https://issues.apache.org/jira/browse/NUTCH-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy updated NUTCH-1187: - Attachment: NUTCH-1187.patch This patches logs key for the parser. It uses INFO level and changes the surrounding DEBUG logs to INFO. This makes sure a parse for every one of the total four scenarios is logged only once: -Skipped because of different id. -Skipped because already parsed. -Forced parse of already parsed. -Regular parsing. Port NUTCH-1028 to nutchgora - log parser keys -- Key: NUTCH-1187 URL: https://issues.apache.org/jira/browse/NUTCH-1187 Project: Nutch Issue Type: Sub-task Components: parser Reporter: Ferdy Priority: Trivial Fix For: nutchgora Attachments: NUTCH-1187.patch This task is to port NUTCH-1028 to nutchgora - log parser keys. Very trivial, will attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1187) Port NUTCH-1028 to nutchgora - log parser keys
[ https://issues.apache.org/jira/browse/NUTCH-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy updated NUTCH-1187: - Patch Info: Patch Available Port NUTCH-1028 to nutchgora - log parser keys -- Key: NUTCH-1187 URL: https://issues.apache.org/jira/browse/NUTCH-1187 Project: Nutch Issue Type: Sub-task Components: parser Reporter: Ferdy Priority: Trivial Fix For: nutchgora Attachments: NUTCH-1187.patch This task is to port NUTCH-1028 to nutchgora - log parser keys. Very trivial, will attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Setting properties in gora.properties
Hi, I'm currently trying to complete NUTCH-902 and GORA-39 and kill two birds with the one stone, however I've uprooted some more nasties which I'm now trying to address. When configuring Nutchgora with Cassandra I'm getting the following lewis@lewis-01:~/ASF/nutchgora/runtime/local$ bin/nutch inject urls crawldb InjectorJob: starting InjectorJob: urlDir: urls InjectorJob: org.apache.gora.util.GoraException: java.io.IOException: java.io.IOException: Property with base name servers could not be found, make sure to include this property in gora.properties file at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292) Caused by: java.io.IOException: java.io.IOException: Property with base name servers could not be found, make sure to include this property in gora.properties file at org.apache.gora.cassandra.store.CassandraStore.readMapping(CassandraStore.java:462) at org.apache.gora.cassandra.store.CassandraStore.initialize(CassandraStore.java:91) at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:81) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:104) ... 7 more Caused by: java.io.IOException: Property with base name servers could not be found, make sure to include this property in gora.properties file at org.apache.gora.store.DataStoreFactory.findPropertyOrDie(DataStoreFactory.java:254) at org.apache.gora.cassandra.store.CassandraStore.createClient(CassandraStore.java:394) at org.apache.gora.cassandra.store.CassandraStore.readMapping(CassandraStore.java:425) ... 10 more Can someone please explain a bit about what kind of properties we can/should add to gora.properties for cassandra setup. I've tried editing gora.properties as follows with no luck #gora.sqlstore.jdbc.driver=org.hsqldb.jdbcDriver #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest servers=localhost/127.0.0.1:9160 If there are any resources people are aware of on the net then I'll begin getting my head around them. Thanks in advance Lewis -- *Lewis*
[jira] [Created] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1188: Description: LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); was: LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA
[jira] [Updated] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1188: Patch Info: Patch Available ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1188: Attachment: LogUtil.patch patch for the bug ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
[ https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reopened NUTCH-902: Reopened as Cassandra configurations in ivy/ivy.xml are not complete. Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box -- Key: NUTCH-902 URL: https://issues.apache.org/jira/browse/NUTCH-902 Project: Nutch Issue Type: New Feature Components: documentation, storage Affects Versions: nutchbase Reporter: Enis Soztutar Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-902-v2.patch, NUTCH-902.patch As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra. The following changes are needed: conf/gora-sql-mapping.xml conf/gora-hbase-mapping.xml conf/gora-cassandra-mapping.xml comments on nutch-default and ivy.xml Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141238#comment-13141238 ] Lewis John McGibbney commented on NUTCH-1189: - Ferdy, would it be possible for you to attach a patch for HBase (if required), I will work on the Cassandra stuff, then hopefully we can knock ours heads together with some others to get the remaining back ends included within the gora.poperties file. add commented out default settings to gora.properties files Key: NUTCH-1189 URL: https://issues.apache.org/jira/browse/NUTCH-1189 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora This issues should have been dealt with as part of its parent issue, however I think as it is a fairly lareg task in itself, it needs to be done independently. The gora.properties file should, amongst other settings, and beside the extreme basic defaults for sqlstore, include defaults for opening HBase, Cassandra, etc servers on their default ports etc. Leaving this down to individual interpretation puts a huge owness of the user, hence constructing a barrier to entry for getting the configuration settings up and running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141240#comment-13141240 ] Lewis John McGibbney commented on NUTCH-1188: - Thank you for this patch. In the short term, when we get one other +1, I would like to commit. Can I ask you to have a look @ NUTCH-1138 and comment on whether the patch is any use for your activities. It is our vision to remove LogUtil and use the Slf4j/Log4j framework for all logging. Thank you very much for this patch. ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt, which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1190: Attachment: date-styles.txt MoreIndexingFilter.patch MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Attachments: MoreIndexingFilter.patch, date-styles.txt There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt, which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1190: Description: There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. was: There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt, which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Attachments: MoreIndexingFilter.patch, date-styles.txt There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt);
[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141267#comment-13141267 ] Julien Nioche commented on NUTCH-1188: -- +1 to commit. See corresponding class in branch nutchgora Thanks ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1191) Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse
Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse -- Key: NUTCH-1191 URL: https://issues.apache.org/jira/browse/NUTCH-1191 Project: Nutch Issue Type: Sub-task Reporter: Ferdy Galema Fix For: nutchgora -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
[ https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-902: --- Attachment: NUTCH-902-v3.patch patch to include previous config changes to NUTCHGORA/ivy/ivy.xml Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box -- Key: NUTCH-902 URL: https://issues.apache.org/jira/browse/NUTCH-902 Project: Nutch Issue Type: New Feature Components: documentation, storage Affects Versions: nutchbase Reporter: Enis Soztutar Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch As per the discussion in the mailing list and http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the necessary files and configuration. I propose that we maintain configuration for at least SQL, HBase and Cassandra. The following changes are needed: conf/gora-sql-mapping.xml conf/gora-hbase-mapping.xml conf/gora-cassandra-mapping.xml comments on nutch-default and ivy.xml Shall we also include jars from gora-hbase, gora-cassandra and their dependencies ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1191) Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse
[ https://issues.apache.org/jira/browse/NUTCH-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1191: Attachment: NUTCH-1191.patch Patch replaces all references with 'parse' argument to the 'fetcher.parse' property and sets it to FALSE default throughout the code (there was still a reference that used TRUE). Tested with both TRUE and FALSE and it works like a charm. Will commit when there are no objections. Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse -- Key: NUTCH-1191 URL: https://issues.apache.org/jira/browse/NUTCH-1191 Project: Nutch Issue Type: Sub-task Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1191.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1191) Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse
[ https://issues.apache.org/jira/browse/NUTCH-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1191: Component/s: fetcher Port NUTCH-1102 to nutchgora - consistent use of fetcher.parse -- Key: NUTCH-1191 URL: https://issues.apache.org/jira/browse/NUTCH-1191 Project: Nutch Issue Type: Sub-task Components: fetcher Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1191.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141300#comment-13141300 ] Lewis John McGibbney commented on NUTCH-1188: - Is it just me, or has this already been committed along with NUTCH-1078 in trunk [1] when Julien fixed it in Nutchgora branch [2]! [1] http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/LogUtil.java?r1=1175075r2=1177290diff_format=h [2] http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/util/LogUtil.java?r1=983885r2=988544diff_format=h ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141334#comment-13141334 ] Zhang JinYan edited comment on NUTCH-1138 at 11/1/11 5:13 PM: -- Apply the path to branch-1.4, rebuild with cmd: ant clean build. Config to crawl websites: {quote} http://172.16.123.123/bbs/viewthread.php?tid=12345 http://172.16.123.123/bbs/attachment.php?aid=12345 http://www.jettycn.com/ {quote} The previous two sites are not available. Run crawl with cmd(platform windows): {quote} sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/ {quote} Complete the crawl successfully. Query in solr admin return: {code:xml} result name=response numFound=320 start=0/result {code} Search word ERROR in hadoop.log,find 3 results caused by: {code} java.net.ConnectException: Connection timed out: connect {code} Search word Exception in hadoop.log, find results like this: {quote} 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server www.jettycn.com failed to respond 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - Retrying request {quote} So there is no exception related your path in the hadoop.log. The path work fine with branch-1.4 for me. was (Author: yearn20m): Apply the path to branch-1.4, rebuild with cmd: ant clean build. Config to crawl websites: {quote} http://172.16.123.123/bbs/viewthread.php?tid=12345 http://172.16.123.123/bbs/attachment.php?aid=12345 http://www.jettycn.com/ {quote} The previous two sites are not available. Run crawl with cmd(platform windows): {quote} sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/ {quote} Complete the crawl successfully. Query in solr admin return: {code:xml} result name=response numFound=320 start=0/result {code} Check the hadoop.log, search word ERROR,find 3 results caused by: {code} java.net.ConnectException: Connection timed out: connect {code} Search word Exception, find results like this: {quote} 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server www.jettycn.com failed to respond 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - Retrying request {quote} So there is no exception related your path in the hadoop.log. The path work fine with branch-1.4 for me. remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141334#comment-13141334 ] Zhang JinYan edited comment on NUTCH-1138 at 11/1/11 5:12 PM: -- Apply the path to branch-1.4, rebuild with cmd: ant clean build. Config to crawl websites: {quote} http://172.16.123.123/bbs/viewthread.php?tid=12345 http://172.16.123.123/bbs/attachment.php?aid=12345 http://www.jettycn.com/ {quote} The previous two sites are not available. Run crawl with cmd(platform windows): {quote} sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/ {quote} Complete the crawl successfully. Query in solr admin return: {code:xml} result name=response numFound=320 start=0/result {code} Check the hadoop.log, search word ERROR,find 3 results caused by: {code} java.net.ConnectException: Connection timed out: connect {code} Search word Exception, find results like this: {quote} 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server www.jettycn.com failed to respond 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - Retrying request {quote} So there is no exception related your path in the hadoop.log. The path work fine with branch-1.4 for me. was (Author: yearn20m): Apply the path to branch-1.4, rebuild with cmd: ant clean build. Config to crawl websites: {quote} http://172.16.123.123/bbs/viewthread.php?tid=12345 http://172.16.123.123/bbs/attachment.php?aid=12345 http://www.jettycn.com/ {quote} The previous two sites are not available. Run crawl with cmd(platform windows): {quote} sh.exe ./bin/nutch crawl seedurl -dir crawldev -solr http://localhost:8983/solr/ {quote} Complete the crawl successfully.Query int solr admin return: {code:xml} result name=response numFound=320 start=0/result {code} Check the hadoop.log, search word ERROR,find 3 results caused by: {code} java.net.ConnectException: Connection timed out: connect {code} Search word Exception, find results like this: {quote} 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server www.jettycn.com failed to respond 2011-11-02 00:39:01,821 INFO httpclient.HttpMethodDirector - Retrying request {quote} So there is no exception related your path in the hadoop.log. The path work fine with branch-1.4 for me. remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
recrawl sites in nutch 1.3
hi, i want to re_crawl my sites every hour. i write a script for this. i edit some properties in nutch-site.xml. but my re_crawler fetches urls only for 3 times an after that it stop fetching. it's mean that my nutch don't update after 3 hours. this is my changes in nutch-site.xml: property namedb.fetch.interval.default/name value30/value descriptionThe default number of seconds between re-fetches of a page (30 days)./description /property property namedb.fetch.schedule.class/name valueorg.apache.nutch.crawl.AdaptiveFetchSchedule/value descriptionThe implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time, regardless of page changes./description /property property namesolr.commit.size/name value10/value descriptionDefines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory./description /property property namedb.fetch.interval.max/name value36000/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status./description /property -- View this message in context: http://lucene.472066.n3.nabble.com/recrawl-sites-in-nutch-1-3-tp3470457p3470457.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141376#comment-13141376 ] Lewis John McGibbney commented on NUTCH-1138: - Hi. Current 1.4 development is located at the trunk area of the SVN area. Is this where the confusion is possibly stemming from? When we make code commits, we are committing to the trunk 1.4 development, rather than the branch-1.4 development. The reasoning behind this can be seen on the latest announcement on the Nutch home page. remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira