DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
------------------------------------------------------------------

                 Key: NUTCH-420
                 URL: http://issues.apache.org/jira/browse/NUTCH-420
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.9.0
            Reporter: Dogacan Güney
            Priority: Minor


DeleteDuplicates.HashPartitioner.reduce():

// byScore case
if (value.score > highest.score) {
  highest.keep = false;
  LOG.debug("-discard " + highest + ", keep " + value);
  output.collect(highest.url, highest);     // delete highest
  highest = value;
}
// !byScore is also similar

So assume two docs with same hash are in values.If the first has higher score 
than the second than second doc will be deleted. But if the first has lower 
score than the second then none will be deleted. AFAICS, there should be an 
else condition to delete value and keep highest as it is.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to