[
https://issues.apache.org/jira/browse/NUTCH-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792465#comment-13792465
]
Sebastian Nagel commented on NUTCH-1646:
----------------------------------------
Hi Markus,
the patch silently assumes that db datum comes before fetch datum while looping
over values. We know that Hadoop does not guarantee stable sorting of values in
reduce method, cf. NUTCH-1616. To avoid faults and NPEs at random, it may be
better to place the check after the loop or do it separately for db and fetch
datum.
> IndexerMapReduce to consider DB status
> --------------------------------------
>
> Key: NUTCH-1646
> URL: https://issues.apache.org/jira/browse/NUTCH-1646
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.7
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: NUTCH-1646-trunk.patch
>
>
> IndexerMapReduce does not remove gone and redirects via DB status, only fetch
> status. This means segments merged before we fixed SegmentMerger may contain
> records that do not have a correct status. For example, some pages are gone
> on the web, gone in the CrawlDB, gone in the segments. But merging those old
> segments could cause a older status to prevail, causing it to be indexed
> although the CrawlDB says it's gone.
--
This message was sent by Atlassian JIRA
(v6.1#6144)