[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193150#comment-15193150
 ] 

Alessandro Benedetti commented on LUCENE-6336:
----------------------------------------------

Initially I liked the idea of adding a component responsible of the 
de-duplication .
But I would like to raise some considerations, what about the number of the 
suggestions ?

At the moment the number of suggestions upbound the search in the auxiliary 
lucene index 
( see this 
org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.java:591 .)

This means that retrieving a max of 5 suggestions could bring the return of 5 
duplicates ( leaving other values in the remaining results) .
Then the dedupe wrapper will dedupe  and return only 1 suggestion ( we forget 
about other 4 good suggestions that were low in the ranking)
We potentially risk to not cover the top N we wants in the configuration.

I was thinking we should solve this Lucene side, building a better query using 
field collapsing.
In particular I think we should add a couple of parameters ( unique=true and 
weightCalculus =max|min|avg ect ) and play with something similar to : 
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results .

What do you think [~janhoy], [~mikemccand]? I think with field collapsing we 
could be more consistent.
I will study this more, please inform me if my reasoning lacks of some 
important assumption :)




> AnalyzingInfixSuggester needs duplicate handling
> ------------------------------------------------
>
>                 Key: LUCENE-6336
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6336
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.10.3, 5.0
>            Reporter: Jan Høydahl
>             Fix For: 5.2, master
>
>         Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>       "engl":{
>         "numFound":101,
>         "suggestions":[{
>             "term":"<b>Engl</b>ish",
>             "weight":100,
>             "payload":"0"},
>           {
>             "term":"<b>Engl</b>ish",
>             "weight":99,
>             "payload":"0"},
>           {
>             "term":"<b>Engl</b>ish",
>             "weight":98,
>             "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to