[ http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365892 ]
M.Oliver Scheele commented on NUTCH-205: ---------------------------------------- Here's an easy to reproduce example for the issue above: 1.) Install Nutch 0.7.1 2.) To simulate a slow-network/timeout set the following parameter in you "crawl-tool.xml": <property> <name>http.max.delays</name> <value>1</value> </property> <property> <name>fetcher.threads.per.host</name> <value>1</value> </property> 3.) Prepare your crawl as described in the tutorial and let it run: > nutch crawl urls.txt -dir crawl_data -depth 4 4.) While the crawl is running you notice some erros like this: > fetch of http://www.webpage.de/mypage.html failed with: java.lang.Exception: > org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. [Note: That's ok, as the message said the page would be recrawled next time] 5.) After crawling has finished check the web-db with the following command: > nutch readdb crawl_data/db/ -dumppageurl You will see something like this: > Page 5: Version: 4 > URL: http://www.webpage.de/mypage.html > ID: 21c1e01ff46ec28c0af6da63512e43bf > Next fetch: Sun Aug 17 07:12:55 CET 292278994 > Retries since fetch: 0 > Retry interval: 1 days --> As you see the page would be refetched not before the year 292278994. A "retry later" wouldn't be done and the page will never appear in a nutch search result. :( > Wrong 'fetch date' for non available pages > ------------------------------------------ > > Key: NUTCH-205 > URL: http://issues.apache.org/jira/browse/NUTCH-205 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7, 0.7.1 > Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API > Reporter: M.Oliver Scheele > > Web-Pages that couldn't be fetched because of a time-out wouldn't be > refetched anymore. > The next fetch in the web-db is set to Long.max. > Example: > ------------- > While fetching our URLs, we got some errors like this: > 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html > failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: > Exceeded ttp.max.delays: retry later. > That seems to be ok and indicates some network problems. > The problem is that the entry in the Webdb shows the following: > Page 4: Version: 4 > URL: http://www.test-domain.de/crawl_html/page_2.html > ID: b360ec931855b0420776909bd96557c0 > Next fetch: Sun Aug 17 07:12:55 CET 292278994 > Retries since fetch: 0 > Retry interval: 0 days > The 'Next fetch' date is set to the year '292278994'. > Probably I wouldn't be able to see the refetch alive. ;) > A page that couldn't be crawled because of networks-problems, > should be refetched with the next crawl (== set next fetch date current time > + 1h). > Possible Bug-Fixing: > ---------------------------- > When updating the web-db the method updateForSegment() in the > UpdateDatabaseTool.class, > set the fetch-date always to Long.max for any (unknown) exception during > fetching. > The RETRY status is not always set correctly. > Change the following lines: > } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY && > page.getRetriesSinceFetch() < MAX_RETRIES) { > pageRetry(fo); // retry later > } else { > pageGone(fo); // give up: page is gone > } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
