[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186177#comment-13186177
 ] 

Markus Jelsma commented on NUTCH-1247:
--------------------------------------

Lewis, we're seeing many URL's with a high retry value. When the value is 
greater than 127 they're negative. This is in itself not a problem but it seems 
in my setup it will continue to increase.

Andrzej, there may indeed be something wrong. Might this be related to 
NUTCH-1245 then? There seems to be something wrong with the following 
CrawlDBReducer code:

{code}
260     case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
261     if (oldSet) {
262     result.setSignature(old.getSignature()); // use old signature
263     }
264     result = schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
265     prevModifiedTime, fetch.getFetchTime());
266     if (result.getRetriesSinceFetch() < retryMax) {
267     result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
268     } else {
269     result.setStatus(CrawlDatum.STATUS_DB_GONE);
270     }
271     break;
{code}

In setPageRetrySchedule() the num retries is always incremented. This causes 
records with exceptions such as UnknownHostException to be refetched for each 
segment. This makes sense because the first segment in our cycle has much more 
exceptions than average.

Do you follow?
                
> CrawlDatum.retries should be int
> --------------------------------
>
>                 Key: NUTCH-1247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1247
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to