[ 
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903916#comment-13903916
 ] 

Sebastian Nagel commented on NUTCH-1706:
----------------------------------------

Hi [~markus17], point 2 is definitely a problem: in a sample crawl (seed was 
{{http://nutch.apache.org/}}) out of 2 fetch_notmodified items one is lost when 
indexing (data attached).
{code}
# 1. index only "old" segments
% bin/nutch index -Ddummy.path=index2013.txt crawl/crawldb \
    crawl/segments/20131115203640/ \
    crawl/segments/20131115203847/ \
    -deleteGone

# 2. also include "new" segment containing refetches
% bin/nutch index -Ddummy.path=index2014.txt crawl/crawldb \
    crawl/segments/20131115203640/ \
    crawl/segments/20131115203847/ \
    crawl/segments/20140217140849/ \
    -deleteGone

# 3. since the "new" segment contains only "successful" refetches (of 
fetch_success or fetch_notmodified)
#    both indexes should contain exactly the same number of documents. But they 
do not!
% diff index2013.txt index2014.txt 
26d25
< add   http://tika.apache.org/
{code}
The second not modified page ({{http://nutch.apache.org/}}) is indexed. Running 
the debugger showed that ordering of values in the reduce function is different 
for both pages, also in local mode. We should take this serious and check 
whether we could guarantee that the newest values are always preferred (similar 
as in SegmentMerger).

Nevertheless a fetch_notmodified datum should never overwrite any other fetch 
datum. Attached patch includes this check again, apart from that it is 
identical to [~markus17]'s patch.

> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
>                 Key: NUTCH-1706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, 
> nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located 
> after all reducer values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to