[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463056 ]
Dogacan Güney commented on NUTCH-420: ------------------------------------- I thought I would attach an index which exhibits this bug. If you run dedup on the attached file, you can see that neither dup.html nor original.html is removed from the index even though they have the same digest. > DeleteDuplicates.HashPartitioner depends on the order of IndexDocs > ------------------------------------------------------------------ > > Key: NUTCH-420 > URL: https://issues.apache.org/jira/browse/NUTCH-420 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.9.0 > Reporter: Dogacan Güney > Priority: Minor > Attachments: dedup-v2.patch, dedup.patch, index.tar.gz > > > DeleteDuplicates.HashPartitioner.reduce(): > // byScore case > if (value.score > highest.score) { > highest.keep = false; > LOG.debug("-discard " + highest + ", keep " + value); > output.collect(highest.url, highest); // delete highest > highest = value; > } > // !byScore is also similar > So assume two docs with same hash are in values.If the first has higher score > than the second than second doc will be deleted. But if the first has lower > score than the second then none will be deleted. AFAICS, there should be an > else condition to delete value and keep highest as it is. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers