[
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1335:
---------------------------------
Attachment: NUTCH-1335.patch
Updated as well. This reduces increases performance on very large crawls where
segments contain duplicates.
> OutlinkDB to collect unique URL's only
> --------------------------------------
>
> Key: NUTCH-1335
> URL: https://issues.apache.org/jira/browse/NUTCH-1335
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1335-1.6-1.patch, NUTCH-1335.patch
>
>
> The aggregating code in the Outlink reducer does not take care of incoming
> duplicates. When the input segments contain duplicates of a single URL they
> are collected.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)