[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

Riyaz Shaik (JIRA) Wed, 10 Jul 2013 04:14:33 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435
 ]


Riyaz Shaik edited comment on NUTCH-1457 at 7/10/13 11:13 AM:
--------------------------------------------------------------

That logic may not work in the following scenario. (fetchtime > currentTime) 
may be true when generateJob is running, but it will return false when the 
DbUpdaterJob is running, if fetch & parse takes too much time.  This will again 
lead to the same issue. Instead, I have made the following fix locally(on 2.1) 
and testing it. It seems to be working fine. It will be great if some one 
validates this fix. 

1. Introduced the new crawl Status on CrawlStatus.java
{code}
  public static final byte STATUS_UNSCHEDULED    = 0x20;
  NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page 
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
     Context context) throws IOException, InterruptedException {
….
..
    // check fetch schedule
    boolean shouldFetch = schedule.shouldFetch(url, page, curTime);
    float score = page.getScore();
    if (!shouldFetch) {
      page.setStatus(CrawlStatus.STATUS_UNSCHEDULED);
       if (GeneratorJob.LOG.isDebugEnabled()) {
             GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', 
fetchTime=" +
                         page.getFetchTime() + ", curTime=" + curTime);
       }
    } else {
      try {
            score = scoringFilters.generatorSortValue(url, page, score);
      } catch (ScoringFilterException e) {
            //ignore
      }
    }

   entry.set(url, score);
    context.write(entry, page);
}
{code}
3. In GeneratorReducer.java, skip all other processing for the status 
UNSCHEDULED and persist the data to the webpage.
{code}
protected void reduce(SelectorEntry key, Iterable<WebPage> values,
      Context context) throws IOException, InterruptedException {
    for (WebPage page : values) {
      if (count >= limit) {
        return;
      }
      if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) {
        writeOutput(context, key.url, page);
        continue;
      }
      if (maxCount > 0) {
        String hostordomain;
        if (byDomain) {
          hostordomain = URLUtil.getDomainName(key.url);
        } else {
          hostordomain = URLUtil.getHost(key.url);
        }

        Integer hostCount = hostCountMap.get(hostordomain);
        if (hostCount == null) {
          hostCountMap.put(hostordomain, 0);
          hostCount = 0;
        }
        if (hostCount >= maxCount) {
          return;
        }
        hostCountMap.put(hostordomain, hostCount + 1);
      }

      Mark.GENERATE_MARK.putMark(page, batchId);
      if (!writeOutput(context, key.url, page)) {
        context.getCounter("Generator", "MALFORMED_URL").increment(1);
        continue;
      }
      context.getCounter("Generator", "GENERATE_MARK").increment(1);
      count++;
    }
  }
{code}

4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is 
UNSCHEDULED and call a regular forceRefetch.
{code}
protected void reduce(UrlWithScore key, Iterable<NutchWritable> values,
      Context context) throws IOException, InterruptedException {
      ….

..
byte status = (byte)page.getStatus();
      switch (status) {
      case CrawlStatus.STATUS_UNSCHEDULED:         // not scheduled for 
generate. due to fetchtime > currentime
        if (maxInterval < page.getFetchInterval())
              schedule.forceRefetch(url, page, false);
        break;
      case CrawlStatus.STATUS_FETCHED:         // succesful fetch
      case CrawlStatus.STATUS_REDIR_TEMP:      // successful fetch, redirected
      case CrawlStatus.STATUS_REDIR_PERM:
      case CrawlStatus.STATUS_NOTMODIFIED:     // successful fetch, notmodified
        int modified = FetchSchedule.STATUS_UNKNOWN;
        if (status == CrawlStatus.STATUS_NOTMODIFIED) {
          modified = FetchSchedule.STATUS_NOTMODIFIED;

…

}
{code}


                
      was (Author: riyaz):
    That logic may not work in the following scenario. (fetchtime > 
currentTime) may be true when generateJob is running, but it will return false 
when the DbUpdaterJob is running, if fetch & parse takes too much time.  This 
will again lead to the same issue. Instead, I have made the following fix 
locally and testing it. It seems to be working fine. It will be great if some 
one validates this fix. 

1. Introduced the new crawl Status on CrawlStatus.java
{code}
  public static final byte STATUS_UNSCHEDULED    = 0x20;
  NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page 
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
     Context context) throws IOException, InterruptedException {
….
..
    // check fetch schedule
    boolean shouldFetch = schedule.shouldFetch(url, page, curTime);
    float score = page.getScore();
    if (!shouldFetch) {
      page.setStatus(CrawlStatus.STATUS_UNSCHEDULED);
       if (GeneratorJob.LOG.isDebugEnabled()) {
             GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', 
fetchTime=" +
                         page.getFetchTime() + ", curTime=" + curTime);
       }
    } else {
      try {
            score = scoringFilters.generatorSortValue(url, page, score);
      } catch (ScoringFilterException e) {
            //ignore
      }
    }

   entry.set(url, score);
    context.write(entry, page);
}
{code}
3. In GeneratorReducer.java, skip all other processing for the status 
UNSCHEDULED and persist the data to the webpage.
{code}
protected void reduce(SelectorEntry key, Iterable<WebPage> values,
      Context context) throws IOException, InterruptedException {
    for (WebPage page : values) {
      if (count >= limit) {
        return;
      }
      if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) {
        writeOutput(context, key.url, page);
        continue;
      }
      if (maxCount > 0) {
        String hostordomain;
        if (byDomain) {
          hostordomain = URLUtil.getDomainName(key.url);
        } else {
          hostordomain = URLUtil.getHost(key.url);
        }

        Integer hostCount = hostCountMap.get(hostordomain);
        if (hostCount == null) {
          hostCountMap.put(hostordomain, 0);
          hostCount = 0;
        }
        if (hostCount >= maxCount) {
          return;
        }
        hostCountMap.put(hostordomain, hostCount + 1);
      }

      Mark.GENERATE_MARK.putMark(page, batchId);
      if (!writeOutput(context, key.url, page)) {
        context.getCounter("Generator", "MALFORMED_URL").increment(1);
        continue;
      }
      context.getCounter("Generator", "GENERATE_MARK").increment(1);
      count++;
    }
  }
{code}

4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is 
UNSCHEDULED and call a regular forceRefetch.
{code}
protected void reduce(UrlWithScore key, Iterable<NutchWritable> values,
      Context context) throws IOException, InterruptedException {
      ….

..
byte status = (byte)page.getStatus();
      switch (status) {
      case CrawlStatus.STATUS_UNSCHEDULED:         // not scheduled for 
generate. due to fetchtime > currentime
        if (maxInterval < page.getFetchInterval())
              schedule.forceRefetch(url, page, false);
        break;
      case CrawlStatus.STATUS_FETCHED:         // succesful fetch
      case CrawlStatus.STATUS_REDIR_TEMP:      // successful fetch, redirected
      case CrawlStatus.STATUS_REDIR_PERM:
      case CrawlStatus.STATUS_NOTMODIFIED:     // successful fetch, notmodified
        int modified = FetchSchedule.STATUS_UNKNOWN;
        if (status == CrawlStatus.STATUS_NOTMODIFIED) {
          modified = FetchSchedule.STATUS_NOTMODIFIED;

…

}
{code}


                  
> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1457
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1457
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.4
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

Reply via email to