[ 
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1335:
---------------------------------

    Attachment: NUTCH-1335-1.6-1.patch

Patch for 1.5. The reducer now only collects records that are equal to or 
higher than mostRecent timestamp. This can still result in duplicates in the 
aggregated collection but not a significant amount.

This patch seems to work as the troubled reducer finished nicely. I'll test 
with a few more runs with each a very large amount of input records also 
containing duplicates.
                
> OutlinkDB to collect unique URL's only
> --------------------------------------
>
>                 Key: NUTCH-1335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1335
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1335-1.6-1.patch
>
>
> The aggregating code in the Outlink reducer does not take care of incoming 
> duplicates. When the input segments contain duplicates of a single URL they 
> are collected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to