[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

Gerhard Gossen (JIRA) Mon, 26 Jan 2015 23:58:45 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293144#comment-14293144
 ]


Gerhard Gossen commented on NUTCH-1922:
---------------------------------------

[~lewismc]: Unfortunately not, I just ran into this behavior when I running 
some longer crawls (100-200 batches) and noticed that the number of pages with 
a given batch ID dropped from 1000 in the beginning to sometimes even around 50 
pages.

There are at least three approaches that may work, but I currently don't have 
the time to test them:
# Change the mapper to run over the entire table, but skip the processing in 
the mapper for already handled pages. This would however increase the amount of 
data that is shuffeled around.
# Write out the result of the link and score aggregation to a temporary 
location and join this with the existing data. This requires an additional 
map/reduce job, but would touch fewer rows of the table.
# In the reducer, look up pages in the store that are currently considered new. 
This is easiest to implement, but the performance will probably be pretty bad.

For now I will try the last approach, but I was hoping that someone more 
familiar with the code might have some better approach.

> DbUpdater overwrites fetch status for URLs from previous batches, causes 
> repeated re-fetches
> --------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1922
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1922
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Gerhard Gossen
>
> When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
> resets the fetch status of that URL to {{unfetched}}. This makes this URL 
> available for a re-fetch, even if its crawl interval is not yet over.
> To reproduce, using version 2.3:
> {code}
> # Nutch configuration
> ant runtime
> cd runtime/local
> mkdir seeds
> echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
> bin/crawl seeds test 2
> {code}
> This uses two files {{a.html}} and {{b.html}} that link to each other.
> In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
> In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
> This should update the score and link fields of {{a.html}}, but not the fetch 
> status. However, when I run {{bin/nutch readdb -crawlId test -url 
> http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
> {{status: 1 (status_unfetched)}}.
> Expected would be {{status: 2 (status_fetched)}}.
> The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
> processed in the same batch always belong to new 
> pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
>  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
> job, but that change skipped all pages with a different batch ID, so I assume 
> that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

Reply via email to