[Nutch-dev] [jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

Andrzej Bialecki (JIRA) Mon, 30 Oct 2006 07:20:39 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12445619 ] 
            
Andrzej Bialecki  commented on NUTCH-95:
----------------------------------------


This issue is fixed in recent version of DeleteDuplicates - see NUTCH-371.

> DeleteDuplicates depends on the order of input segments
> -------------------------------------------------------
>
>                 Key: NUTCH-95
>                 URL: http://issues.apache.org/jira/browse/NUTCH-95
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.7, 0.8, 0.6
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>
> DeleteDuplicates depends on what order the input segments are processed, 
> which in turn depends on the order of segment dirs returned from 
> NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
> to deleting wrong records from indexes. The silent assumption that segments 
> at the end of the listing are more recent is not always true.
> Here's the explanation:
> * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
> URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
> is just an int index to the array of open IndexReaders - and if segment dirs 
> are moved/copied/renamed then entries in that array may change their  order. 
> And then for all equal triples Dedup keeps just the first entry. Naturally, 
> if segmentIdx is changed due to dir renaming, a different record will be kept 
> and different ones will be deleted...
> * then Dedup deletes content duplicates, again by computing hashes for each 
> content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
> now we already have a different set of undeleted docs depending on the order 
> of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
> changes when you re-shuffle the input segment dirs - so again, when identical 
> entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
> Solution: use the fetched date from the first record in each segment to 
> determine the order of segments. Alternatively, modify DeleteDuplicates to 
> use the newer algorithm from SegmentMergeTool. This algorithm works by 
> sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
> urlLength). Then:
> 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
> latest version, as recorded by Fetcher).
> 2. If contentHash is the same, keep the doc with the highest score, and then 
> if the scores are the same, keep the doc with the shortest url.
> Initial fix will be prepared for the trunk/ and then backported to the 
> release branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

Reply via email to