[
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601777#comment-14601777
]
Sebastian Nagel commented on NUTCH-1335:
----------------------------------------
Reasonable. But wouldn't it be consequent to take only one (the first, the
last, the most recent)? In the worst case, if links are just sorted from old to
new, all of them are still taken.
> OutlinkDB to collect unique URL's only
> --------------------------------------
>
> Key: NUTCH-1335
> URL: https://issues.apache.org/jira/browse/NUTCH-1335
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1335-1.6-1.patch, NUTCH-1335.patch
>
>
> The aggregating code in the Outlink reducer does not take care of incoming
> duplicates. When the input segments contain duplicates of a single URL they
> are collected.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)