[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

Michiel (JIRA) Tue, 27 Jan 2015 23:32:56 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294825#comment-14294825
 ]


Michiel commented on NUTCH-1922:
--------------------------------

I can confirm this issue - it appears to be a rather serious behavioural change 
between 2.2.1 and 2.3. The batchId parameter makes updatedb unaware of all the 
previously fetched pages present in the datastore. On relatively small scale 
crawls the effect is disastrous: pages are re-fetched every few rounds.

Various earlier tickets, such as NUTCH-1556 and NUTCH-1679 suggest the use of 
-all to prevent this behaviour. However, this does NOT seem to change anything. 
A ran a few test rounds, and pages are still refetched way before their 
nextFetchTime. As in the example of Gerhard in the description, the fetch 
status of a certain page is overwritten once a link to it is found in another 
batch. The -all parameter does not appear to make updatedb aware of all the 
pages already present in the datastore. This is rather curious: the changes 
from NUTCH-1556 would indicate that only support for the batchId parameter was 
added, but it seems this made updatedb ignorant of previously fetched pages 
regardless of batchId or -all.

> DbUpdater overwrites fetch status for URLs from previous batches, causes 
> repeated re-fetches
> --------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1922
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1922
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Gerhard Gossen
>
> When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
> resets the fetch status of that URL to {{unfetched}}. This makes this URL 
> available for a re-fetch, even if its crawl interval is not yet over.
> To reproduce, using version 2.3:
> {code}
> # Nutch configuration
> ant runtime
> cd runtime/local
> mkdir seeds
> echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
> bin/crawl seeds test 2
> {code}
> This uses two files {{a.html}} and {{b.html}} that link to each other.
> In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
> In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
> This should update the score and link fields of {{a.html}}, but not the fetch 
> status. However, when I run {{bin/nutch readdb -crawlId test -url 
> http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
> {{status: 1 (status_unfetched)}}.
> Expected would be {{status: 2 (status_fetched)}}.
> The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
> processed in the same batch always belong to new 
> pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
>  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
> job, but that change skipped all pages with a different batch ID, so I assume 
> that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

Reply via email to