[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

Markus Jelsma (JIRA) Thu, 25 Jun 2015 04:08:22 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601006#comment-14601006
 ]


Markus Jelsma commented on NUTCH-1625:
--------------------------------------

Hello Sebastian, we used this in Nutch 1.6 when we reindexed many old segments, 
containing duplicates. At the time we had trouble reindexing those segments, 
some entries didn't make it in the index. We fixed that issue with that patch. 
I looked at the current code again, but i think the problem is still there in 
the case of many segments and containing duplicates.

> IndexerMapReduce skips FETCH_NOTMODIFIED
> ----------------------------------------
>
>                 Key: NUTCH-1625
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1625
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.11
>
>         Attachments: NUTCH-1625.patch, NUTCH-1625.patch
>
>
> IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also 
> skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check 
> but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED 
> isn't very useful anyway because since 1.5 orso we can safely rely on 
> DB_NOTMODIFIED as it is properly set in the CrawlDBReducer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

Reply via email to