[ 
https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515954
 ] 

Hudson commented on NUTCH-516:
------------------------------

Integrated in Nutch-Nightly #162 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/162/])

> Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE
> --------------------------------------------------------------------
>
>                 Key: NUTCH-516
>                 URL: https://issues.apache.org/jira/browse/NUTCH-516
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>         Environment: Java 1.6, Linux 2.6
>            Reporter: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-516.patch
>
>
> We can not crawl some page due to a robots restriction. In this case we 
> update the db with the Metada: _pst_:robots_denied(18) , we add the status 
> code 3 and we change the fecth interval to 67.5 days.
> Unfortunetely the Fetch time is never change, so it keeps generating this 
> page and fetching it every time.
> We should update the schedule fetch in crawldb to reflect to the fetch 
> interval.
> We should add in crawldbreducer:
> case CrawlDatum.STATUS_FETCH_GONE:            // permanent failure
>       if (old != null)
>         result.setSignature(old.getSignature());  // use old signature
>       result.setStatus(CrawlDatum.STATUS_DB_GONE);
>       result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime,
>           prevModifiedTime, fetch.getFetchTime());
>      // set the schedule
>       result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
>           prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), 
> modified);
>       break;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to