DeleteDuplicates depends on the order of input segments
-------------------------------------------------------

         Key: NUTCH-95
         URL: http://issues.apache.org/jira/browse/NUTCH-95
     Project: Nutch
        Type: Bug
  Components: indexer  
    Versions: 0.8-dev, 0.6, 0.7    
    Reporter: Andrzej Bialecki 
 Assigned to: Andrzej Bialecki  


DeleteDuplicates depends on what order the input segments are processed, which 
in turn depends on the order of segment dirs returned from 
NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
to deleting wrong records from indexes. The silent assumption that segments at 
the end of the listing are more recent is not always true.

Here's the explanation:

* Dedup first deletes the URL duplicates by computing MD5 hashes for each URL, 
and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx is just 
an int index to the array of open IndexReaders - and if segment dirs are 
moved/copied/renamed then entries in that array may change their  order. And 
then for all equal triples Dedup keeps just the first entry. Naturally, if 
segmentIdx is changed due to dir renaming, a different record will be kept and 
different ones will be deleted...

* then Dedup deletes content duplicates, again by computing hashes for each 
content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
now we already have a different set of undeleted docs depending on the order of 
input segments. On top of that, the same factor acts here, i.e. segmentIdx 
changes when you re-shuffle the input segment dirs - so again, when identical 
entries are compared the one with the lowest (segmentIdx, docIdx) is picked.

Solution: use the fetched date from the first record in each segment to 
determine the order of segments. Alternatively, modify DeleteDuplicates to use 
the newer algorithm from SegmentMergeTool. This algorithm works by sorting 
records using tuples of (urlHash, contentHash, fetchDate, score, urlLength). 
Then:

1. If urlHash is the same, keep the doc with the highest fetchDate  (the latest 
version, as recorded by Fetcher).
2. If contentHash is the same, keep the doc with the highest score, and then if 
the scores are the same, keep the doc with the shortest url.

Initial fix will be prepared for the trunk/ and then backported to the release 
branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. 
Download it for free - -and be entered to win a 42" plasma tv or your very
own Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to