[
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903916#comment-13903916
]
Sebastian Nagel commented on NUTCH-1706:
----------------------------------------
Hi [~markus17], point 2 is definitely a problem: in a sample crawl (seed was
{{http://nutch.apache.org/}}) out of 2 fetch_notmodified items one is lost when
indexing (data attached).
{code}
# 1. index only "old" segments
% bin/nutch index -Ddummy.path=index2013.txt crawl/crawldb \
crawl/segments/20131115203640/ \
crawl/segments/20131115203847/ \
-deleteGone
# 2. also include "new" segment containing refetches
% bin/nutch index -Ddummy.path=index2014.txt crawl/crawldb \
crawl/segments/20131115203640/ \
crawl/segments/20131115203847/ \
crawl/segments/20140217140849/ \
-deleteGone
# 3. since the "new" segment contains only "successful" refetches (of
fetch_success or fetch_notmodified)
# both indexes should contain exactly the same number of documents. But they
do not!
% diff index2013.txt index2014.txt
26d25
< add http://tika.apache.org/
{code}
The second not modified page ({{http://nutch.apache.org/}}) is indexed. Running
the debugger showed that ordering of values in the reduce function is different
for both pages, also in local mode. We should take this serious and check
whether we could guarantee that the newest values are always preferred (similar
as in SegmentMerger).
Nevertheless a fetch_notmodified datum should never overwrite any other fetch
datum. Attached patch includes this check again, apart from that it is
identical to [~markus17]'s patch.
> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
> Key: NUTCH-1706
> URL: https://issues.apache.org/jira/browse/NUTCH-1706
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.7
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Blocker
> Fix For: 1.8
>
> Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch,
> nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located
> after all reducer values have been gathered.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)