Gerhard Gossen created NUTCH-1922:
-------------------------------------

             Summary: DbUpdater overwrites fetch status for URLs from previous 
batches, causes repeated re-fetches
                 Key: NUTCH-1922
                 URL: https://issues.apache.org/jira/browse/NUTCH-1922
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.3
            Reporter: Gerhard Gossen


When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
resets the fetch status of that URL to {{unfetched}}. This makes this URL 
available for a re-fetch, even if its crawl interval is not yet over.

To reproduce, using version 2.3:
{code}
# Nutch configuration
ant runtime
cd runtime/local
mkdir seeds
echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
bin/crawl seeds test 2
{code}

This uses two files {{a.html}} and {{b.html}} that link to each other.
In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In 
batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This 
should update the score and link fields of {{a.html}}, but not the fetch 
status. However, when I run {{bin/nutch readdb -crawlId test -url 
http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 
1 (status_unfetched)}}.

Expected would be {{status: 2 (status_fetched)}}.

The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
processed in the same batch always belong to new 
pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
 Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
job, but that change skipped all pages with a different batch ID, so I assume 
that this introduced this behavior.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to