[
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193150#comment-15193150
]
Alessandro Benedetti commented on LUCENE-6336:
----------------------------------------------
Initially I liked the idea of adding a component responsible of the
de-duplication .
But I would like to raise some considerations, what about the number of the
suggestions ?
At the moment the number of suggestions upbound the search in the auxiliary
lucene index
( see this
org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.java:591 .)
This means that retrieving a max of 5 suggestions could bring the return of 5
duplicates ( leaving other values in the remaining results) .
Then the dedupe wrapper will dedupe and return only 1 suggestion ( we forget
about other 4 good suggestions that were low in the ranking)
We potentially risk to not cover the top N we wants in the configuration.
I was thinking we should solve this Lucene side, building a better query using
field collapsing.
In particular I think we should add a couple of parameters ( unique=true and
weightCalculus =max|min|avg ect ) and play with something similar to :
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results .
What do you think [~janhoy], [~mikemccand]? I think with field collapsing we
could be more consistent.
I will study this more, please inform me if my reasoning lacks of some
important assumption :)
> AnalyzingInfixSuggester needs duplicate handling
> ------------------------------------------------
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 4.10.3, 5.0
> Reporter: Jan Høydahl
> Fix For: 5.2, master
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index
> with multiple documents containing the same text, but with random weights
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
> "suggest":{"languages":{
> "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"<b>Engl</b>ish",
> "weight":100,
> "payload":"0"},
> {
> "term":"<b>Engl</b>ish",
> "weight":99,
> "payload":"0"},
> {
> "term":"<b>Engl</b>ish",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So
> there is a need for some duplicate removal here, either while building the
> local suggest index or during lookup. Only the highest weight suggestion for
> a given term should be returned.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]