[jira] [Commented] (NUTCH-1646) IndexerMapReduce to consider DB status

Sebastian Nagel (JIRA) Fri, 11 Oct 2013 01:38:20 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792465#comment-13792465
 ]


Sebastian Nagel commented on NUTCH-1646:
----------------------------------------

Hi Markus,
the patch silently assumes that db datum comes before fetch datum while looping 
over values. We know that Hadoop does not guarantee stable sorting of values in 
reduce method, cf. NUTCH-1616. To avoid faults and NPEs at random, it may be 
better to place the check after the loop or do it separately for db and fetch 
datum. 

> IndexerMapReduce to consider DB status
> --------------------------------------
>
>                 Key: NUTCH-1646
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1646
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.8
>
>         Attachments: NUTCH-1646-trunk.patch
>
>
> IndexerMapReduce does not remove gone and redirects via DB status, only fetch 
> status. This means segments merged before we fixed SegmentMerger may contain 
> records that do not have a correct status. For example, some pages are gone 
> on the web, gone in the CrawlDB, gone in the segments. But merging those old 
> segments could cause a older status to prevail, causing it to be indexed 
> although the CrawlDB says it's gone.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (NUTCH-1646) IndexerMapReduce to consider DB status

Reply via email to