document deduplication (exact duplicates) failed using MD5Signature
-------------------------------------------------------------------

                 Key: NUTCH-835
                 URL: https://issues.apache.org/jira/browse/NUTCH-835
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.1, 1.0.0
         Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
            Reporter: Sebastian Nagel


The MD5Signature class calculates different signatures for identical documents.

The reason is that
  byte[] data = content.getContent();
  ... StringBuilder().append(data) ...
uses java.lang.Object.toString() to get a string representation of the (binary) 
content
which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
arrays
with identical content.

A solution would be to take the MD5 sum of the binary content as first part of 
the
final signature calculation (the parsed content is the second part):
  ... 
.append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to