[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2018-11-17 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690571#comment-16690571
 ] 

Mike Sokolov commented on LUCENE-6336:
--

I dug into this a bit - it seems that we already do provide 
{{SortedInputIterator}} in Lucene-land, but it is not used by 
{{DocumentExpressionDictionary}} and its factory. It seems to me that could 
expose an option for de-duping. Wouldn't want to make it the default, since 
your dictionary might already be unique and you wouldn't want to pay the 
penalty for sorting in that case. If we agree that is the solution, I think 
this issue should get moved over to Solr, and in that case the unit test in the 
patch isn't really pointing at the problem.

It's certainly possible to subclass 
{{DocumentExpressionDictionaryFactory.create(...)}} and  
{{DocumentExpressionDictionary.getEntryIterator()}} to wrap the original 
iterator with  {{SortedInputIterator}}, but this does require some Java 
programming.


> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: lookup, suggester
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2018-11-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676877#comment-16676877
 ] 

Samuel Solís commented on LUCENE-6336:
--

Hi,

I'm a new Solr user and this is my first comment in a issue. Sorry if my 
knowledge is not the best to report an issue.

I'm created a suggest system like the described in the issue and the problem is 
exactly the same. I have configured a BlendedInfixLookupFactory with a 
multivalue field and 

DocumentExpressionDictionaryFactory as a dictionaryImpl. The problem is that 
the suggestions contain duplicates if the weight are different and it's a bad 
behavior I think. The idea of remove duplicates using params like "_unique=true 
and weightCalculus =max|min|avg_" seems nice.

I know that the issue is for a 5.0 version but I'm using 6.6 and it's still 
active and the problem is not resolved yet. how can I help? I'm not a Java 
developer (I'm developer but I don't use Java) but I can test something if you 
want or create tests or something. Or if somebody know a better solution just 
to discuss it.

 

Thanks!

 

 

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: lookup, suggester
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2016-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337696#comment-15337696
 ] 

Michael McCandless commented on LUCENE-6336:


We could explore field collapsing / grouping, but that's maybe somewhat tricky 
to do with early termination (see LUCENE-7341) and it's somewhat wasteful ... 
it seems better to dedup once at indexing time?  And if it's a simple wrapper 
around the dictionary, other suggesters could just use that too

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
>  Labels: lookup, suggester
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2016-03-14 Thread Alessandro Benedetti (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193150#comment-15193150
 ] 

Alessandro Benedetti commented on LUCENE-6336:
--

Initially I liked the idea of adding a component responsible of the 
de-duplication .
But I would like to raise some considerations, what about the number of the 
suggestions ?

At the moment the number of suggestions upbound the search in the auxiliary 
lucene index 
( see this 
org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.java:591 .)

This means that retrieving a max of 5 suggestions could bring the return of 5 
duplicates ( leaving other values in the remaining results) .
Then the dedupe wrapper will dedupe  and return only 1 suggestion ( we forget 
about other 4 good suggestions that were low in the ranking)
We potentially risk to not cover the top N we wants in the configuration.

I was thinking we should solve this Lucene side, building a better query using 
field collapsing.
In particular I think we should add a couple of parameters ( unique=true and 
weightCalculus =max|min|avg ect ) and play with something similar to : 
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results .

What do you think [~janhoy], [~mikemccand]? I think with field collapsing we 
could be more consistent.
I will study this more, please inform me if my reasoning lacks of some 
important assumption :)




> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: 5.2, master
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2015-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356985#comment-14356985
 ] 

Michael McCandless commented on LUCENE-6336:


I think whether a given suggester dedups is really up to each impl.

But separately I think it makes sense to add enable dedup for AIS somehow.

Or maybe we add a DedupDictionaryWrapper, which does an offline sort to remove 
dups?  This way we can dedup for any suggester that doesn't handle it itself... 
and we keep the "simplicity in responsibility" for AIS.

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2015-03-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357689#comment-14357689
 ] 

Jan Høydahl commented on LUCENE-6336:
-

bq. Or maybe we add a DedupDictionaryWrapper, which does an offline sort to 
remove dups?...
This make sense. It would even allow an {{AvgScoreDedupDictionaryWrapper}} 
which dedupes entries returning the average score instead of max.

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2015-03-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356754#comment-14356754
 ] 

Shai Erera commented on LUCENE-6336:


So I wrote these two simple tests:

+FuzzySuggester+
{code}
  public void testDuplicateInput() throws Exception {
Input keys[] = new Input[] {
new Input("duplicate", 8),
new Input("duplicate", 12),
new Input("duplicate", 12),
};

Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, 
true, MockTokenFilter.ENGLISH_STOPSET);
FuzzySuggester suggester = new FuzzySuggester(analyzer, analyzer,
AnalyzingSuggester.EXACT_FIRST | AnalyzingSuggester.PRESERVE_SEP, 256,
-1, false, FuzzySuggester.DEFAULT_MAX_EDITS,
FuzzySuggester.DEFAULT_TRANSPOSITIONS,
FuzzySuggester.DEFAULT_NON_FUZZY_PREFIX,
FuzzySuggester.DEFAULT_MIN_FUZZY_LENGTH,
FuzzySuggester.DEFAULT_UNICODE_AWARE);
suggester.build(new InputArrayIterator(keys));

List results = 
suggester.lookup(TestUtil.stringToCharSequence("dup", random()), false, 1);
System.out.println(results);
   
analyzer.close();
  }
{code}

This prints:

{code}
[duplicate/12]
{code}

+AnalyzingInfixSuggester+
{code}
  public void testDuplicateInput() throws Exception {
Input keys[] = new Input[] {
new Input("duplicate", 8),
new Input("duplicate", 12),
new Input("duplicate", 12),
};

Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false);
AnalyzingInfixSuggester suggester = new 
AnalyzingInfixSuggester(newDirectory(), a, a, 3, false);
suggester.build(new InputArrayIterator(keys));

List results = 
suggester.lookup(TestUtil.stringToCharSequence("dup", random()), 10, true, 
true);
System.out.println(results);

suggester.close();
  }
{code}

Prints:

{code}
[duplicate/12, duplicate/12, duplicate/8]
{code}

Both tests use an {{InputArrayIterator}} and the same {{.buikd()}} API - the 
only thing that's different is the Suggester type. So if I think about a 
component in my software that gets a {{Lookup}} and uses the common API to 
populate values in it and lookup, it shouldn't care about the type of the 
Lookup instance (right?).

Would be good if we can be consistent IMO, but I know that there is a 
fundamental difference between the two suggesters -- Fuzzy builds an FST, which 
I think is the component that resolves the duplicates, while 
AnalyzingInfixSuggester builds an index. Perhaps in its createResult method it 
can add the results to a Set (or in addition to the List) to resolve the 
duplicates at lookup time. Of course it would be better if it can detect the 
lookups at build() or .add() time and avoid their matches in the first place.

Usually suggesters handle unique values, and the question is who should ensure 
the values they are given as input is unique -- is it the Suggester or the 
user. That that FuzzySuggester happens to use a data structure that resolves 
the duplicates is a side effect IMO. AnalyzingInfixSuggester also take a 
context with each value to suggest, so the value "foo" isn't the same if input 
twice with different contexts. Therefore it's more involved IMO with 
AnalyzingInfix vs Fuzzy ...

I'm also not sure that the Suggester is the one that should take care of 
uniqueness because the added logic will be executed for all users, whether 
their input values are unique or not. But if for example we could have 
DocumentDictionary resolve duplicates, then we would leave the suggester do 
what it should do -- suggest from a given list of values. I like that 
simplicity in responsibility.

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload

[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2015-03-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356523#comment-14356523
 ] 

Jan Høydahl commented on LUCENE-6336:
-

[~mikemccand] what's your view on this one?

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2015-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346900#comment-14346900
 ] 

Jan Høydahl commented on LUCENE-6336:
-

bq. Not a bug: DocumentDictionary etc suggests documents, not terms.

What you say implies that the field you use for the suggested terms must be 
100% unique across the main document index. Suggesters are typically used to 
suggest e.g. authors, languages, categories... Indeed, the example in 
https://cwiki.apache.org/confluence/display/solr/Suggester does just this, 
suggesting categories using price field as weight, and {{DocumentDictionary}}. 
But FuzzySuggester is not index-based so it does not reveal this bug.



> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling

2015-03-04 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346879#comment-14346879
 ] 

Robert Muir commented on LUCENE-6336:
-

Not a bug: DocumentDictionary etc suggests documents, not terms.

> AnalyzingInfixSuggester needs duplicate handling
> 
>
> Key: LUCENE-6336
> URL: https://issues.apache.org/jira/browse/LUCENE-6336
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.3, 5.0
>Reporter: Jan Høydahl
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6336.patch
>
>
> Spinoff from LUCENE-5833 but else unrelated.
> Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and 
> stores payload and score together with the suggest text.
> I did some testing with Solr, producing the DocumentDictionary from an index 
> with multiple documents containing the same text, but with random weights 
> between 0-100. Then I got duplicate identical suggestions sorted by weight:
> {code}
> {
>   "suggest":{"languages":{
>   "engl":{
> "numFound":101,
> "suggestions":[{
> "term":"English",
> "weight":100,
> "payload":"0"},
>   {
> "term":"English",
> "weight":99,
> "payload":"0"},
>   {
> "term":"English",
> "weight":98,
> "payload":"0"},
> ---etc all the way down to 0---
> {code}
> I also reproduced the same behavior in AnalyzingInfixSuggester directly. So 
> there is a need for some duplicate removal here, either while building the 
> local suggest index or during lookup. Only the highest weight suggestion for 
> a given term should be returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org