[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435 ]
Riyaz Shaik commented on NUTCH-1457: ------------------------------------ That logic may not work in the following scenario. (fetchtime > currentTime) may be true when generateJob is running, but it will return false when the DbUpdaterJob is running, if fetch & parse takes too much time. This will again lead to the same issue. Instead, I have made the following fix locally and testing it. It seems to be working fine. It will be great if some one validates this fix. 1. Introduced the new crawl Status on CrawlStatus.java {code} public static final byte STATUS_UNSCHEDULED = 0x20; NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled"); {code} 2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page status to the new status(UNSCHEDULED) and allow the page to reducer. {code} public void map(String reversedUrl, WebPage page, Context context) throws IOException, InterruptedException { …. .. // check fetch schedule boolean shouldFetch = schedule.shouldFetch(url, page, curTime); float score = page.getScore(); if (!shouldFetch) { page.setStatus(CrawlStatus.STATUS_UNSCHEDULED); if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', fetchTime=" + page.getFetchTime() + ", curTime=" + curTime); } } else { try { score = scoringFilters.generatorSortValue(url, page, score); } catch (ScoringFilterException e) { //ignore } } entry.set(url, score); context.write(entry, page); } {code} 3. In GeneratorReducer.java, skip all other processing for the status UNSCHEDULED and persist the data to the webpage. {code} protected void reduce(SelectorEntry key, Iterable<WebPage> values, Context context) throws IOException, InterruptedException { for (WebPage page : values) { if (count >= limit) { return; } if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) { writeOutput(context, key.url, page); continue; } if (maxCount > 0) { String hostordomain; if (byDomain) { hostordomain = URLUtil.getDomainName(key.url); } else { hostordomain = URLUtil.getHost(key.url); } Integer hostCount = hostCountMap.get(hostordomain); if (hostCount == null) { hostCountMap.put(hostordomain, 0); hostCount = 0; } if (hostCount >= maxCount) { return; } hostCountMap.put(hostordomain, hostCount + 1); } Mark.GENERATE_MARK.putMark(page, batchId); if (!writeOutput(context, key.url, page)) { context.getCounter("Generator", "MALFORMED_URL").increment(1); continue; } context.getCounter("Generator", "GENERATE_MARK").increment(1); count++; } } {code} 4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is UNSCHEDULED and call a regular forceRefetch. {code} protected void reduce(UrlWithScore key, Iterable<NutchWritable> values, Context context) throws IOException, InterruptedException { …. .. byte status = (byte)page.getStatus(); switch (status) { case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for generate. due to fetchtime > currentime if (maxInterval < page.getFetchInterval()) schedule.forceRefetch(url, page, false); break; case CrawlStatus.STATUS_FETCHED: // succesful fetch case CrawlStatus.STATUS_REDIR_TEMP: // successful fetch, redirected case CrawlStatus.STATUS_REDIR_PERM: case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified int modified = FetchSchedule.STATUS_UNKNOWN; if (status == CrawlStatus.STATUS_NOTMODIFIED) { modified = FetchSchedule.STATUS_NOTMODIFIED; … } {code} > Nutch2 Refactor the update process so that fetched items are only processed > once > -------------------------------------------------------------------------------- > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement > Reporter: Ferdy Galema > Fix For: 2.4 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira