[
https://issues.apache.org/jira/browse/NUTCH-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718269#comment-13718269
]
Sebastian Nagel commented on NUTCH-1616:
----------------------------------------
Hi Markus,
IndexerMapReduce.reduce() assumes that the most recent fetch datum comes last,
older ones are then overwritten. Because segments are added to indexer in
chronological order (lexical order by date string) we rely on stable sorting by
map-reduce. While CrawlDatum.compareTo() sorts primarily by score,
SegmentMerger.reduce() explicitly sorts by segment name. Afaik, Hadoop does not
guarantee stable sorting. Maybe, older Hadoop versions pass the values in the
order read-in from segments (and did not further sort CrawlDatum values). Other
values (ParseData, etc.) used by indexer may not be affected because they do
not implement WritableComparable.
That's just an assumption. Verification is not simple: either we must rely on
weak hints (e.g., fetch time) or add the segment name to CrawlDatum's meta data.
> SegmentMerger missing proper crawl_fetch datum
> ----------------------------------------------
>
> Key: NUTCH-1616
> URL: https://issues.apache.org/jira/browse/NUTCH-1616
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.7
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.8
>
>
> Merged 26036 vs. unmerged 26038 indexed documents! There are two records on
> the merged segment that no longer have a crawl_fetch CrawlDatum with a
> fetch_success status. Instead, the only crawl_fetch CrawlDatum has status
> linked!
> The original segment two crawl_fetch CrawlDatums with linked and the
> fetch_success status.
> Without the fetch_success of not_modified status it is not going to be
> indexed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira