[Nutch-dev] [jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

Andrzej Bialecki (JIRA) Sat, 28 Jan 2006 18:07:02 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12364355 ]


Andrzej Bialecki  commented on NUTCH-95:
----------------------------------------

Yes, it should. SegmentMergeTool should handle this correctly in 0.7. For 0.8 
it is not (yet) supported...

> DeleteDuplicates depends on the order of input segments
> -------------------------------------------------------
>
>          Key: NUTCH-95
>          URL: http://issues.apache.org/jira/browse/NUTCH-95
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>     Versions: 0.8-dev, 0.6, 0.7
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 

>
> DeleteDuplicates depends on what order the input segments are processed, 
> which in turn depends on the order of segment dirs returned from 
> NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
> to deleting wrong records from indexes. The silent assumption that segments 
> at the end of the listing are more recent is not always true.
> Here's the explanation:
> * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
> URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
> is just an int index to the array of open IndexReaders - and if segment dirs 
> are moved/copied/renamed then entries in that array may change their  order. 
> And then for all equal triples Dedup keeps just the first entry. Naturally, 
> if segmentIdx is changed due to dir renaming, a different record will be kept 
> and different ones will be deleted...
> * then Dedup deletes content duplicates, again by computing hashes for each 
> content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
> now we already have a different set of undeleted docs depending on the order 
> of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
> changes when you re-shuffle the input segment dirs - so again, when identical 
> entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
> Solution: use the fetched date from the first record in each segment to 
> determine the order of segments. Alternatively, modify DeleteDuplicates to 
> use the newer algorithm from SegmentMergeTool. This algorithm works by 
> sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
> urlLength). Then:
> 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
> latest version, as recorded by Fetcher).
> 2. If contentHash is the same, keep the doc with the highest score, and then 
> if the scores are the same, keep the doc with the shortest url.
> Initial fix will be prepared for the trunk/ and then backported to the 
> release branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

Reply via email to