[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064618#comment-13064618
]
Julien Nioche commented on NUTCH-1037:
--------------------------------------
* indentation : not that bad indeed - must be my eyes getting tired :-)
* lowercase the when deduplicating (but not sending)=> you lowercase the
anchors before checking whether have already been found and if not send them as
they are to SOLR (which is the right thing to do as it's up to SOLR to take
care of the analysis)
> Deduplicate anchors before indexing
> -----------------------------------
>
> Key: NUTCH-1037
> URL: https://issues.apache.org/jira/browse/NUTCH-1037
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1037-1.4-1.patch, NUTCH-1037-1.4-2.patch,
> NUTCH-1037-2.0-1.patch, NUTCH-1037-2.0-2.patch
>
>
> Anchors are not deduplicated before indexing. This can result in a very high
> number of similar and identical anchors being indexed. Before indexing,
> anchors must be deduplicated at least on case.
> Should this be implemented as a fix or as a new feature that needs to be
> configured?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira