[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

Julien Nioche (JIRA) Mon, 11 Jul 2011 08:02:26 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063380#comment-13063380
 ]


Julien Nioche commented on NUTCH-1037:
--------------------------------------

IIRC it stores a byte array for each term, there are utilities to simplify the 
coding to the byte array.
We could distribute the occurrences for the whole anchor to each individual 
term it contains. Have done that for an ex-client, worked a treat. 

Only trouble as you pointed out is that the search part of it is not supported 
by default is SOLR (yet), so this is not an option for now but could be used 
later instead of the deduplication you suggested which would have too much 
impact on the scores. 

Of course if the deduplication is optional then that's fine and people can 
choose between having smaller docs vs more relevant search on the anchors

> Deduplicate anchors before indexing
> -----------------------------------
>
>                 Key: NUTCH-1037
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1037
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1037-1.4-1.patch, NUTCH-1037-2.0-1.patch
>
>
> Anchors are not deduplicated before indexing. This can result in a very high 
> number of similar and identical anchors being indexed. Before indexing, 
> anchors must be deduplicated at least on case.
> Should this be implemented as a fix or as a new feature that needs to be 
> configured?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

Reply via email to