document deduplication (exact duplicates) failed using MD5Signature

                 Key: NUTCH-835
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.1, 1.0.0
         Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
            Reporter: Sebastian Nagel

The MD5Signature class calculates different signatures for identical documents.

The reason is that
  byte[] data = content.getContent();
  ... StringBuilder().append(data) ...
uses java.lang.Object.toString() to get a string representation of the (binary) 
which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
with identical content.

A solution would be to take the MD5 sum of the binary content as first part of 
final signature calculation (the parsed content is the second part):
Of course, there are many other solutions...

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to