[
https://issues.apache.org/jira/browse/NUTCH-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1608:
----------------------------------------
Fix Version/s: 2.3
> SolrDeleteDuplicates bug: choosing preferred page when duplicates does not
> work
> -------------------------------------------------------------------------------
>
> Key: NUTCH-1608
> URL: https://issues.apache.org/jira/browse/NUTCH-1608
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 2.1, 2.2.1
> Environment: all
> Reporter: Brian
> Priority: Minor
> Labels: patch
> Fix For: 2.3
>
> Attachments: NUTCH-1608.patch
>
>
> There is a bug in the code for deciding which version of a page to keep when
> there are duplicates. This is a bug in the reduce function and is a common
> pitfall when using hadoop/mapreduce, as explained here:
>
> http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
> The issue is that in the reduce function getting the next iterator does not
> change the location of the reference returned, but only updates the content
> at the same location (and returns that same location - i.e., reference), so
> it is not correct to compare with a previously stored reference as they point
> to the same location and thus will be the same. Instead it is necessary to
> make a copy of the object to preserve it for later comparison.
> The patch added also encodes additional preferences between URLs: after
> comparing the boost values it then compares the extension - preferring either
> no extension or a .htm or .html extension, then length - preferring shorter
> URLs, then timestamp. This can be modified as desired by changing the
> contents of the "isPreferredOver" method.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira