[ http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365434 ]
Andrzej Bialecki commented on NUTCH-205: ----------------------------------------- This is a design choice, not a bug. The errors you see are due to improper configuration - some threads cannot access the host for a long time, because of the limit of concurrent requests to a single host. Please see "fetcher.threads.per.host" and "http.max.delays" config properties. > Wrong 'fetch date' for non available pages > ------------------------------------------ > > Key: NUTCH-205 > URL: http://issues.apache.org/jira/browse/NUTCH-205 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7, 0.7.1 > Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API > Reporter: M.Oliver Scheele > > Web-Pages that couldn't be fetched because of a time-out wouldn't be > refetched anymore. > The next fetch in the web-db is set to Long.max. > Example: > ------------- > While fetching our URLs, we got some errors like this: > 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html > failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: > Exceeded ttp.max.delays: retry later. > That seems to be ok and indicates some network problems. > The problem is that the entry in the Webdb shows the following: > Page 4: Version: 4 > URL: http://www.test-domain.de/crawl_html/page_2.html > ID: b360ec931855b0420776909bd96557c0 > Next fetch: Sun Aug 17 07:12:55 CET 292278994 > Retries since fetch: 0 > Retry interval: 0 days > The 'Next fetch' date is set to the year '292278994'. > Probably I wouldn't be able to see the refetch alive. ;) > A page that couldn't be crawled because of networks-problems, > should be refetched with the next crawl (== set next fetch date current time > + 1h). > Possible Bug-Fixing: > ---------------------------- > When updating the web-db the method updateForSegment() in the > UpdateDatabaseTool.class, > set the fetch-date always to Long.max for any (unknown) exception during > fetching. > The RETRY status is not always set correctly. > Change the following lines: > } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY && > page.getRetriesSinceFetch() < MAX_RETRIES) { > pageRetry(fo); // retry later > } else { > pageGone(fo); // give up: page is gone > } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
