[
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884630#action_12884630
]
Andrzej Bialecki commented on NUTCH-835:
-----------------------------------------
Sorry, I should've been more precise - I committed this to branch-1.2 as well
(r95963).
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
> Assignee: Andrzej Bialecki
> Fix For: 1.2, 2.0
>
>
> The MD5Signature class calculates different signatures for identical
> documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the
> (binary) content
> which results in unique hash codes (e.g., [...@30dc9065) even for two byte
> arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part
> of the
> final signature calculation (the parsed content is the second part):
> ...
> .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.