[ https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515954 ]
Hudson commented on NUTCH-516: ------------------------------ Integrated in Nutch-Nightly #162 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/162/]) > Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE > -------------------------------------------------------------------- > > Key: NUTCH-516 > URL: https://issues.apache.org/jira/browse/NUTCH-516 > Project: Nutch > Issue Type: Bug > Components: fetcher > Environment: Java 1.6, Linux 2.6 > Reporter: Emmanuel Joke > Fix For: 1.0.0 > > Attachments: NUTCH-516.patch > > > We can not crawl some page due to a robots restriction. In this case we > update the db with the Metada: _pst_:robots_denied(18) , we add the status > code 3 and we change the fecth interval to 67.5 days. > Unfortunetely the Fetch time is never change, so it keeps generating this > page and fetching it every time. > We should update the schedule fetch in crawldb to reflect to the fetch > interval. > We should add in crawldbreducer: > case CrawlDatum.STATUS_FETCH_GONE: // permanent failure > if (old != null) > result.setSignature(old.getSignature()); // use old signature > result.setStatus(CrawlDatum.STATUS_DB_GONE); > result = schedule.setPageGoneSchedule((Text)key, result, prevFetchTime, > prevModifiedTime, fetch.getFetchTime()); > // set the schedule > result = schedule.setFetchSchedule((Text)key, result, prevFetchTime, > prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), > modified); > break; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.