[ https://issues.apache.org/jira/browse/NUTCH-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1646: --------------------------------- Description: IndexerMapReduce does not remove gone and redirects via DB status, only fetch status. This means segments merged before we fixed SegmentMerger may contain records that do not have a correct status. For example, some pages are gone on the web, gone in the CrawlDB, gone in the segments. But merging those old segments could cause a older status to prevail, causing it to be indexed although the CrawlDB says it's gone. > IndexerMapReduce to consider DB status > -------------------------------------- > > Key: NUTCH-1646 > URL: https://issues.apache.org/jira/browse/NUTCH-1646 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.7 > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.8 > > > IndexerMapReduce does not remove gone and redirects via DB status, only fetch > status. This means segments merged before we fixed SegmentMerger may contain > records that do not have a correct status. For example, some pages are gone > on the web, gone in the CrawlDB, gone in the segments. But merging those old > segments could cause a older status to prevail, causing it to be indexed > although the CrawlDB says it's gone. -- This message was sent by Atlassian JIRA (v6.1#6144)