[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

Andrzej Bialecki (JIRA) Thu, 17 Jan 2008 12:32:56 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560076#action_12560076
 ]


Andrzej Bialecki  commented on NUTCH-95:
----------------------------------------

See NUTCH-371.

> DeleteDuplicates depends on the order of input segments
> -------------------------------------------------------
>
>                 Key: NUTCH-95
>                 URL: https://issues.apache.org/jira/browse/NUTCH-95
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.6, 0.7, 0.8
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> DeleteDuplicates depends on what order the input segments are processed, 
> which in turn depends on the order of segment dirs returned from 
> NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
> to deleting wrong records from indexes. The silent assumption that segments 
> at the end of the listing are more recent is not always true.
> Here's the explanation:
> * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
> URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
> is just an int index to the array of open IndexReaders - and if segment dirs 
> are moved/copied/renamed then entries in that array may change their  order. 
> And then for all equal triples Dedup keeps just the first entry. Naturally, 
> if segmentIdx is changed due to dir renaming, a different record will be kept 
> and different ones will be deleted...
> * then Dedup deletes content duplicates, again by computing hashes for each 
> content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
> now we already have a different set of undeleted docs depending on the order 
> of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
> changes when you re-shuffle the input segment dirs - so again, when identical 
> entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
> Solution: use the fetched date from the first record in each segment to 
> determine the order of segments. Alternatively, modify DeleteDuplicates to 
> use the newer algorithm from SegmentMergeTool. This algorithm works by 
> sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
> urlLength). Then:
> 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
> latest version, as recorded by Fetcher).
> 2. If contentHash is the same, keep the doc with the highest score, and then 
> if the scores are the same, keep the doc with the shortest url.
> Initial fix will be prepared for the trunk/ and then backported to the 
> release branch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

Reply via email to