[
https://issues.apache.org/jira/browse/NUTCH-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560076#action_12560076
]
Andrzej Bialecki commented on NUTCH-95:
----------------------------------------
See NUTCH-371.
> DeleteDuplicates depends on the order of input segments
> -------------------------------------------------------
>
> Key: NUTCH-95
> URL: https://issues.apache.org/jira/browse/NUTCH-95
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 0.6, 0.7, 0.8
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
>
> DeleteDuplicates depends on what order the input segments are processed,
> which in turn depends on the order of segment dirs returned from
> NutchFileSystem.listFiles(File). In most cases this is undesired and may lead
> to deleting wrong records from indexes. The silent assumption that segments
> at the end of the listing are more recent is not always true.
> Here's the explanation:
> * Dedup first deletes the URL duplicates by computing MD5 hashes for each
> URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx
> is just an int index to the array of open IndexReaders - and if segment dirs
> are moved/copied/renamed then entries in that array may change their order.
> And then for all equal triples Dedup keeps just the first entry. Naturally,
> if segmentIdx is changed due to dir renaming, a different record will be kept
> and different ones will be deleted...
> * then Dedup deletes content duplicates, again by computing hashes for each
> content, and then sorting records by (hash, segmentIdx, docIdx). However, by
> now we already have a different set of undeleted docs depending on the order
> of input segments. On top of that, the same factor acts here, i.e. segmentIdx
> changes when you re-shuffle the input segment dirs - so again, when identical
> entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
> Solution: use the fetched date from the first record in each segment to
> determine the order of segments. Alternatively, modify DeleteDuplicates to
> use the newer algorithm from SegmentMergeTool. This algorithm works by
> sorting records using tuples of (urlHash, contentHash, fetchDate, score,
> urlLength). Then:
> 1. If urlHash is the same, keep the doc with the highest fetchDate (the
> latest version, as recorded by Fetcher).
> 2. If contentHash is the same, keep the doc with the highest score, and then
> if the scores are the same, keep the doc with the shortest url.
> Initial fix will be prepared for the trunk/ and then backported to the
> release branch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.