Wrong 'fetch date' for non available pages
------------------------------------------
Key: NUTCH-205
URL: http://issues.apache.org/jira/browse/NUTCH-205
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.7, 0.7.1
Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
Reporter: M.Oliver Scheele
Web-Pages that couldn't be fetched because of a time-out wouldn't be refetched
anymore.
The next fetch in the web-db is set to Long.max.
Example:
-------------
While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed
with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
ttp.max.delays: retry later.
That seems to be ok and indicates some network problems.
The problem is that the entry in the Webdb shows the following:
Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days
The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive. ;)
A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date current time +
1h).
Possible Bug-Fixing:
----------------------------
When updating the web-db the method updateForSegment() in the
UpdateDatabaseTool.class,
set the fetch-date always to Long.max for any (unknown) exception during
fetching.
The RETRY status is not always set correctly.
Change the following lines:
} else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
page.getRetriesSinceFetch() < MAX_RETRIES) {
pageRetry(fo); // retry later
} else {
pageGone(fo); // give up: page is gone
}
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira