[
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294837#comment-14294837
]
Lewis John McGibbney commented on NUTCH-1922:
---------------------------------------------
Hi folks, please check out NUTCH-1679, I was aware that this was a serious
issue but in terms of getting a release of Nutch 2.3 I decided (with consensus
from the community of VOTE'ers) that we should push Nutch 2.3.
It appears that this is a more serious bug than initially envisaged.
If you [~Michiel] and [~gerhard.gossen] could possibly try out the patch on
NUTCH-1679 then we _may_ be closer to obtaining a solution. I am also very ken
to potentially push a big fix 2.3.1 to resolve this issue as we've already
pushed an upgrade from 1.6 --> 1.7 JDK support. Please let me know what you
guys think, I will also try to chime in but right now I am also trying t put
time into pushing Gora 0.6 so that we can get Docket containers for the next
Nutch 2.X release. Thanks for any comments folks it is very much appreciated.
> DbUpdater overwrites fetch status for URLs from previous batches, causes
> repeated re-fetches
> --------------------------------------------------------------------------------------------
>
> Key: NUTCH-1922
> URL: https://issues.apache.org/jira/browse/NUTCH-1922
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.3
> Reporter: Gerhard Gossen
>
> When Nutch 2 finds a link to a URL that was crawled in a previous batch, it
> resets the fetch status of that URL to {{unfetched}}. This makes this URL
> available for a re-fetch, even if its crawl interval is not yet over.
> To reproduce, using version 2.3:
> {code}
> # Nutch configuration
> ant runtime
> cd runtime/local
> mkdir seeds
> echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
> bin/crawl seeds test 2
> {code}
> This uses two files {{a.html}} and {{b.html}} that link to each other.
> In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}.
> In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}.
> This should update the score and link fields of {{a.html}}, but not the fetch
> status. However, when I run {{bin/nutch readdb -crawlId test -url
> http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns
> {{status: 1 (status_unfetched)}}.
> Expected would be {{status: 2 (status_fetched)}}.
> The reason seems to be that DbUpdateReducer assumes that [links to a URL not
> processed in the same batch always belong to new
> pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
> Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate
> job, but that change skipped all pages with a different batch ID, so I assume
> that this introduced this behavior.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)