[
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063380#comment-13063380
]
Julien Nioche commented on NUTCH-1037:
--------------------------------------
IIRC it stores a byte array for each term, there are utilities to simplify the
coding to the byte array.
We could distribute the occurrences for the whole anchor to each individual
term it contains. Have done that for an ex-client, worked a treat.
Only trouble as you pointed out is that the search part of it is not supported
by default is SOLR (yet), so this is not an option for now but could be used
later instead of the deduplication you suggested which would have too much
impact on the scores.
Of course if the deduplication is optional then that's fine and people can
choose between having smaller docs vs more relevant search on the anchors
> Deduplicate anchors before indexing
> -----------------------------------
>
> Key: NUTCH-1037
> URL: https://issues.apache.org/jira/browse/NUTCH-1037
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1037-1.4-1.patch, NUTCH-1037-2.0-1.patch
>
>
> Anchors are not deduplicated before indexing. This can result in a very high
> number of similar and identical anchors being indexed. Before indexing,
> anchors must be deduplicated at least on case.
> Should this be implemented as a fix or as a new feature that needs to be
> configured?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira