[
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186177#comment-13186177
]
Markus Jelsma commented on NUTCH-1247:
--------------------------------------
Lewis, we're seeing many URL's with a high retry value. When the value is
greater than 127 they're negative. This is in itself not a problem but it seems
in my setup it will continue to increase.
Andrzej, there may indeed be something wrong. Might this be related to
NUTCH-1245 then? There seems to be something wrong with the following
CrawlDBReducer code:
{code}
260 case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
261 if (oldSet) {
262 result.setSignature(old.getSignature()); // use old signature
263 }
264 result = schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
265 prevModifiedTime, fetch.getFetchTime());
266 if (result.getRetriesSinceFetch() < retryMax) {
267 result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
268 } else {
269 result.setStatus(CrawlDatum.STATUS_DB_GONE);
270 }
271 break;
{code}
In setPageRetrySchedule() the num retries is always incremented. This causes
records with exceptions such as UnknownHostException to be refetched for each
segment. This makes sense because the first segment in our cycle has much more
exceptions than average.
Do you follow?
> CrawlDatum.retries should be int
> --------------------------------
>
> Key: NUTCH-1247
> URL: https://issues.apache.org/jira/browse/NUTCH-1247
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Fix For: 1.5
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira