[ http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12445619 ] Andrzej Bialecki commented on NUTCH-95: ----------------------------------------
This issue is fixed in recent version of DeleteDuplicates - see NUTCH-371. > DeleteDuplicates depends on the order of input segments > ------------------------------------------------------- > > Key: NUTCH-95 > URL: http://issues.apache.org/jira/browse/NUTCH-95 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.7, 0.8, 0.6 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > > DeleteDuplicates depends on what order the input segments are processed, > which in turn depends on the order of segment dirs returned from > NutchFileSystem.listFiles(File). In most cases this is undesired and may lead > to deleting wrong records from indexes. The silent assumption that segments > at the end of the listing are more recent is not always true. > Here's the explanation: > * Dedup first deletes the URL duplicates by computing MD5 hashes for each > URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx > is just an int index to the array of open IndexReaders - and if segment dirs > are moved/copied/renamed then entries in that array may change their order. > And then for all equal triples Dedup keeps just the first entry. Naturally, > if segmentIdx is changed due to dir renaming, a different record will be kept > and different ones will be deleted... > * then Dedup deletes content duplicates, again by computing hashes for each > content, and then sorting records by (hash, segmentIdx, docIdx). However, by > now we already have a different set of undeleted docs depending on the order > of input segments. On top of that, the same factor acts here, i.e. segmentIdx > changes when you re-shuffle the input segment dirs - so again, when identical > entries are compared the one with the lowest (segmentIdx, docIdx) is picked. > Solution: use the fetched date from the first record in each segment to > determine the order of segments. Alternatively, modify DeleteDuplicates to > use the newer algorithm from SegmentMergeTool. This algorithm works by > sorting records using tuples of (urlHash, contentHash, fetchDate, score, > urlLength). Then: > 1. If urlHash is the same, keep the doc with the highest fetchDate (the > latest version, as recorded by Fetcher). > 2. If contentHash is the same, keep the doc with the highest score, and then > if the scores are the same, keep the doc with the shortest url. > Initial fix will be prepared for the trunk/ and then backported to the > release branch. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
