[
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435
]
Riyaz Shaik edited comment on NUTCH-1457 at 7/10/13 11:13 AM:
--------------------------------------------------------------
That logic may not work in the following scenario. (fetchtime > currentTime)
may be true when generateJob is running, but it will return false when the
DbUpdaterJob is running, if fetch & parse takes too much time. This will again
lead to the same issue. Instead, I have made the following fix locally(on 2.1)
and testing it. It seems to be working fine. It will be great if some one
validates this fix.
1. Introduced the new crawl Status on CrawlStatus.java
{code}
public static final byte STATUS_UNSCHEDULED = 0x20;
NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
Context context) throws IOException, InterruptedException {
….
..
// check fetch schedule
boolean shouldFetch = schedule.shouldFetch(url, page, curTime);
float score = page.getScore();
if (!shouldFetch) {
page.setStatus(CrawlStatus.STATUS_UNSCHEDULED);
if (GeneratorJob.LOG.isDebugEnabled()) {
GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "',
fetchTime=" +
page.getFetchTime() + ", curTime=" + curTime);
}
} else {
try {
score = scoringFilters.generatorSortValue(url, page, score);
} catch (ScoringFilterException e) {
//ignore
}
}
entry.set(url, score);
context.write(entry, page);
}
{code}
3. In GeneratorReducer.java, skip all other processing for the status
UNSCHEDULED and persist the data to the webpage.
{code}
protected void reduce(SelectorEntry key, Iterable<WebPage> values,
Context context) throws IOException, InterruptedException {
for (WebPage page : values) {
if (count >= limit) {
return;
}
if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) {
writeOutput(context, key.url, page);
continue;
}
if (maxCount > 0) {
String hostordomain;
if (byDomain) {
hostordomain = URLUtil.getDomainName(key.url);
} else {
hostordomain = URLUtil.getHost(key.url);
}
Integer hostCount = hostCountMap.get(hostordomain);
if (hostCount == null) {
hostCountMap.put(hostordomain, 0);
hostCount = 0;
}
if (hostCount >= maxCount) {
return;
}
hostCountMap.put(hostordomain, hostCount + 1);
}
Mark.GENERATE_MARK.putMark(page, batchId);
if (!writeOutput(context, key.url, page)) {
context.getCounter("Generator", "MALFORMED_URL").increment(1);
continue;
}
context.getCounter("Generator", "GENERATE_MARK").increment(1);
count++;
}
}
{code}
4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is
UNSCHEDULED and call a regular forceRefetch.
{code}
protected void reduce(UrlWithScore key, Iterable<NutchWritable> values,
Context context) throws IOException, InterruptedException {
….
..
byte status = (byte)page.getStatus();
switch (status) {
case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for
generate. due to fetchtime > currentime
if (maxInterval < page.getFetchInterval())
schedule.forceRefetch(url, page, false);
break;
case CrawlStatus.STATUS_FETCHED: // succesful fetch
case CrawlStatus.STATUS_REDIR_TEMP: // successful fetch, redirected
case CrawlStatus.STATUS_REDIR_PERM:
case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified
int modified = FetchSchedule.STATUS_UNKNOWN;
if (status == CrawlStatus.STATUS_NOTMODIFIED) {
modified = FetchSchedule.STATUS_NOTMODIFIED;
…
}
{code}
was (Author: riyaz):
That logic may not work in the following scenario. (fetchtime >
currentTime) may be true when generateJob is running, but it will return false
when the DbUpdaterJob is running, if fetch & parse takes too much time. This
will again lead to the same issue. Instead, I have made the following fix
locally and testing it. It seems to be working fine. It will be great if some
one validates this fix.
1. Introduced the new crawl Status on CrawlStatus.java
{code}
public static final byte STATUS_UNSCHEDULED = 0x20;
NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
Context context) throws IOException, InterruptedException {
….
..
// check fetch schedule
boolean shouldFetch = schedule.shouldFetch(url, page, curTime);
float score = page.getScore();
if (!shouldFetch) {
page.setStatus(CrawlStatus.STATUS_UNSCHEDULED);
if (GeneratorJob.LOG.isDebugEnabled()) {
GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "',
fetchTime=" +
page.getFetchTime() + ", curTime=" + curTime);
}
} else {
try {
score = scoringFilters.generatorSortValue(url, page, score);
} catch (ScoringFilterException e) {
//ignore
}
}
entry.set(url, score);
context.write(entry, page);
}
{code}
3. In GeneratorReducer.java, skip all other processing for the status
UNSCHEDULED and persist the data to the webpage.
{code}
protected void reduce(SelectorEntry key, Iterable<WebPage> values,
Context context) throws IOException, InterruptedException {
for (WebPage page : values) {
if (count >= limit) {
return;
}
if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) {
writeOutput(context, key.url, page);
continue;
}
if (maxCount > 0) {
String hostordomain;
if (byDomain) {
hostordomain = URLUtil.getDomainName(key.url);
} else {
hostordomain = URLUtil.getHost(key.url);
}
Integer hostCount = hostCountMap.get(hostordomain);
if (hostCount == null) {
hostCountMap.put(hostordomain, 0);
hostCount = 0;
}
if (hostCount >= maxCount) {
return;
}
hostCountMap.put(hostordomain, hostCount + 1);
}
Mark.GENERATE_MARK.putMark(page, batchId);
if (!writeOutput(context, key.url, page)) {
context.getCounter("Generator", "MALFORMED_URL").increment(1);
continue;
}
context.getCounter("Generator", "GENERATE_MARK").increment(1);
count++;
}
}
{code}
4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is
UNSCHEDULED and call a regular forceRefetch.
{code}
protected void reduce(UrlWithScore key, Iterable<NutchWritable> values,
Context context) throws IOException, InterruptedException {
….
..
byte status = (byte)page.getStatus();
switch (status) {
case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for
generate. due to fetchtime > currentime
if (maxInterval < page.getFetchInterval())
schedule.forceRefetch(url, page, false);
break;
case CrawlStatus.STATUS_FETCHED: // succesful fetch
case CrawlStatus.STATUS_REDIR_TEMP: // successful fetch, redirected
case CrawlStatus.STATUS_REDIR_PERM:
case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified
int modified = FetchSchedule.STATUS_UNKNOWN;
if (status == CrawlStatus.STATUS_NOTMODIFIED) {
modified = FetchSchedule.STATUS_NOTMODIFIED;
…
}
{code}
> Nutch2 Refactor the update process so that fetched items are only processed
> once
> --------------------------------------------------------------------------------
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ferdy Galema
> Fix For: 2.4
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira