[
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1335:
---------------------------------
Attachment: NUTCH-1335-1.6-1.patch
Patch for 1.5. The reducer now only collects records that are equal to or
higher than mostRecent timestamp. This can still result in duplicates in the
aggregated collection but not a significant amount.
This patch seems to work as the troubled reducer finished nicely. I'll test
with a few more runs with each a very large amount of input records also
containing duplicates.
> OutlinkDB to collect unique URL's only
> --------------------------------------
>
> Key: NUTCH-1335
> URL: https://issues.apache.org/jira/browse/NUTCH-1335
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1335-1.6-1.patch
>
>
> The aggregating code in the Outlink reducer does not take care of incoming
> duplicates. When the input segments contain duplicates of a single URL they
> are collected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira