document deduplication (exact duplicates) failed using MD5Signature
-------------------------------------------------------------------
Key: NUTCH-835
URL: https://issues.apache.org/jira/browse/NUTCH-835
Project: Nutch
Issue Type: Bug
Affects Versions: 1.1, 1.0.0
Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
The MD5Signature class calculates different signatures for identical documents.
The reason is that
byte[] data = content.getContent();
... StringBuilder().append(data) ...
uses java.lang.Object.toString() to get a string representation of the (binary)
content
which results in unique hash codes (e.g., [...@30dc9065) even for two byte
arrays
with identical content.
A solution would be to take the MD5 sum of the binary content as first part of
the
final signature calculation (the parsed content is the second part):
...
.append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.