Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE
--------------------------------------------------------------------

                 Key: NUTCH-516
                 URL: https://issues.apache.org/jira/browse/NUTCH-516
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
         Environment: Java 1.6, Linux 2.6
            Reporter: Emmanuel Joke
             Fix For: 1.0.0


We can not crawl some page due to a robots restriction. In this case we update 
the db with the Metada: _pst_:robots_denied(18) , we add the status code 3 and 
we change the fecth interval to 67.5 days.

Unfortunetely the Fetch time is never change, so it keeps generating this page 
and fetching it every time.
We should update the schedule fetch in crawldb to reflect to the fetch interval.

We should add in crawldbreducer:
case CrawlDatum.STATUS_FETCH_GONE:            // permanent failure
      if (old != null)
        result.setSignature(old.getSignature());  // use old signature
      result.setStatus(CrawlDatum.STATUS_DB_GONE);
      result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime,
          prevModifiedTime, fetch.getFetchTime());

     // set the schedule
      result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
          prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), 
modified);

      break;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to