Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE
--------------------------------------------------------------------
Key: NUTCH-516
URL: https://issues.apache.org/jira/browse/NUTCH-516
Project: Nutch
Issue Type: Bug
Components: fetcher
Environment: Java 1.6, Linux 2.6
Reporter: Emmanuel Joke
Fix For: 1.0.0
We can not crawl some page due to a robots restriction. In this case we update
the db with the Metada: _pst_:robots_denied(18) , we add the status code 3 and
we change the fecth interval to 67.5 days.
Unfortunetely the Fetch time is never change, so it keeps generating this page
and fetching it every time.
We should update the schedule fetch in crawldb to reflect to the fetch interval.
We should add in crawldbreducer:
case CrawlDatum.STATUS_FETCH_GONE: // permanent failure
if (old != null)
result.setSignature(old.getSignature()); // use old signature
result.setStatus(CrawlDatum.STATUS_DB_GONE);
result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime());
// set the schedule
result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(),
modified);
break;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers