[
https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2391.
------------------------------------
Resolution: Fixed
Committed to 1.x,
[d35b433|https://github.com/apache/nutch/commit/d35b433c397c03e78245c3e262ecaa31c78a564e].
Thanks, [~kakrofoon]!
> Spurious Duplications for MD5
> -----------------------------
>
> Key: NUTCH-2391
> URL: https://issues.apache.org/jira/browse/NUTCH-2391
> Project: Nutch
> Issue Type: Bug
> Components: commoncrawl
> Affects Versions: 1.11
> Reporter: David Johnson
> Priority: Minor
> Fix For: 1.14
>
>
> We're seeing some incidence of a large number of documents being marked as
> duplicate in our crawl.
> We traced it back to one of the crawl plugins returning an empty array for
> the content field.
> We'd like to propose changing the MD5 signature generation from:
> {code}
> public byte[] calculate(Content content, Parse parse) {
> byte[] data = content.getContent();
> if (data == null)
> data = content.getUrl().getBytes();
> return MD5Hash.digest(data).getDigest();
> }
> {code}
> to:
> {code}
> public byte[] calculate(Content content, Parse parse) {
> byte[] data = content.getContent();
> if ((data == null) || (data.length == 0))
> data = content.getUrl().getBytes();
> return MD5Hash.digest(data).getDigest();
> }
> {code}
> to address the issue
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)