[ 
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1335:
---------------------------------
    Attachment: NUTCH-1335.patch

Updated as well. This reduces increases performance on very large crawls where 
segments contain duplicates.

> OutlinkDB to collect unique URL's only
> --------------------------------------
>
>                 Key: NUTCH-1335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1335
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1335-1.6-1.patch, NUTCH-1335.patch
>
>
> The aggregating code in the Outlink reducer does not take care of incoming 
> duplicates. When the input segments contain duplicates of a single URL they 
> are collected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to