[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690571#comment-16690571 ] Mike Sokolov commented on LUCENE-6336: -- I dug into this a bit - it seems that we already do provide {{SortedInputIterator}} in Lucene-land, but it is not used by {{DocumentExpressionDictionary}} and its factory. It seems to me that could expose an option for de-duping. Wouldn't want to make it the default, since your dictionary might already be unique and you wouldn't want to pay the penalty for sorting in that case. If we agree that is the solution, I think this issue should get moved over to Solr, and in that case the unit test in the patch isn't really pointing at the problem. It's certainly possible to subclass {{DocumentExpressionDictionaryFactory.create(...)}} and {{DocumentExpressionDictionary.getEntryIterator()}} to wrap the original iterator with {{SortedInputIterator}}, but this does require some Java programming. > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Labels: lookup, suggester > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676877#comment-16676877 ] Samuel Solís commented on LUCENE-6336: -- Hi, I'm a new Solr user and this is my first comment in a issue. Sorry if my knowledge is not the best to report an issue. I'm created a suggest system like the described in the issue and the problem is exactly the same. I have configured a BlendedInfixLookupFactory with a multivalue field and DocumentExpressionDictionaryFactory as a dictionaryImpl. The problem is that the suggestions contain duplicates if the weight are different and it's a bad behavior I think. The idea of remove duplicates using params like "_unique=true and weightCalculus =max|min|avg_" seems nice. I know that the issue is for a 5.0 version but I'm using 6.6 and it's still active and the problem is not resolved yet. how can I help? I'm not a Java developer (I'm developer but I don't use Java) but I can test something if you want or create tests or something. Or if somebody know a better solution just to discuss it. Thanks! > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Labels: lookup, suggester > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337696#comment-15337696 ] Michael McCandless commented on LUCENE-6336: We could explore field collapsing / grouping, but that's maybe somewhat tricky to do with early termination (see LUCENE-7341) and it's somewhat wasteful ... it seems better to dedup once at indexing time? And if it's a simple wrapper around the dictionary, other suggesters could just use that too > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Labels: lookup, suggester > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193150#comment-15193150 ] Alessandro Benedetti commented on LUCENE-6336: -- Initially I liked the idea of adding a component responsible of the de-duplication . But I would like to raise some considerations, what about the number of the suggestions ? At the moment the number of suggestions upbound the search in the auxiliary lucene index ( see this org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.java:591 .) This means that retrieving a max of 5 suggestions could bring the return of 5 duplicates ( leaving other values in the remaining results) . Then the dedupe wrapper will dedupe and return only 1 suggestion ( we forget about other 4 good suggestions that were low in the ranking) We potentially risk to not cover the top N we wants in the configuration. I was thinking we should solve this Lucene side, building a better query using field collapsing. In particular I think we should add a couple of parameters ( unique=true and weightCalculus =max|min|avg ect ) and play with something similar to : https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results . What do you think [~janhoy], [~mikemccand]? I think with field collapsing we could be more consistent. I will study this more, please inform me if my reasoning lacks of some important assumption :) > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: 5.2, master > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356985#comment-14356985 ] Michael McCandless commented on LUCENE-6336: I think whether a given suggester dedups is really up to each impl. But separately I think it makes sense to add enable dedup for AIS somehow. Or maybe we add a DedupDictionaryWrapper, which does an offline sort to remove dups? This way we can dedup for any suggester that doesn't handle it itself... and we keep the "simplicity in responsibility" for AIS. > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: Trunk, 5.1 > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357689#comment-14357689 ] Jan Høydahl commented on LUCENE-6336: - bq. Or maybe we add a DedupDictionaryWrapper, which does an offline sort to remove dups?... This make sense. It would even allow an {{AvgScoreDedupDictionaryWrapper}} which dedupes entries returning the average score instead of max. > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: Trunk, 5.1 > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356754#comment-14356754 ] Shai Erera commented on LUCENE-6336: So I wrote these two simple tests: +FuzzySuggester+ {code} public void testDuplicateInput() throws Exception { Input keys[] = new Input[] { new Input("duplicate", 8), new Input("duplicate", 12), new Input("duplicate", 12), }; Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, true, MockTokenFilter.ENGLISH_STOPSET); FuzzySuggester suggester = new FuzzySuggester(analyzer, analyzer, AnalyzingSuggester.EXACT_FIRST | AnalyzingSuggester.PRESERVE_SEP, 256, -1, false, FuzzySuggester.DEFAULT_MAX_EDITS, FuzzySuggester.DEFAULT_TRANSPOSITIONS, FuzzySuggester.DEFAULT_NON_FUZZY_PREFIX, FuzzySuggester.DEFAULT_MIN_FUZZY_LENGTH, FuzzySuggester.DEFAULT_UNICODE_AWARE); suggester.build(new InputArrayIterator(keys)); List results = suggester.lookup(TestUtil.stringToCharSequence("dup", random()), false, 1); System.out.println(results); analyzer.close(); } {code} This prints: {code} [duplicate/12] {code} +AnalyzingInfixSuggester+ {code} public void testDuplicateInput() throws Exception { Input keys[] = new Input[] { new Input("duplicate", 8), new Input("duplicate", 12), new Input("duplicate", 12), }; Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false); AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(newDirectory(), a, a, 3, false); suggester.build(new InputArrayIterator(keys)); List results = suggester.lookup(TestUtil.stringToCharSequence("dup", random()), 10, true, true); System.out.println(results); suggester.close(); } {code} Prints: {code} [duplicate/12, duplicate/12, duplicate/8] {code} Both tests use an {{InputArrayIterator}} and the same {{.buikd()}} API - the only thing that's different is the Suggester type. So if I think about a component in my software that gets a {{Lookup}} and uses the common API to populate values in it and lookup, it shouldn't care about the type of the Lookup instance (right?). Would be good if we can be consistent IMO, but I know that there is a fundamental difference between the two suggesters -- Fuzzy builds an FST, which I think is the component that resolves the duplicates, while AnalyzingInfixSuggester builds an index. Perhaps in its createResult method it can add the results to a Set (or in addition to the List) to resolve the duplicates at lookup time. Of course it would be better if it can detect the lookups at build() or .add() time and avoid their matches in the first place. Usually suggesters handle unique values, and the question is who should ensure the values they are given as input is unique -- is it the Suggester or the user. That that FuzzySuggester happens to use a data structure that resolves the duplicates is a side effect IMO. AnalyzingInfixSuggester also take a context with each value to suggest, so the value "foo" isn't the same if input twice with different contexts. Therefore it's more involved IMO with AnalyzingInfix vs Fuzzy ... I'm also not sure that the Suggester is the one that should take care of uniqueness because the added logic will be executed for all users, whether their input values are unique or not. But if for example we could have DocumentDictionary resolve duplicates, then we would leave the suggester do what it should do -- suggest from a given list of values. I like that simplicity in responsibility. > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: Trunk, 5.1 > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356523#comment-14356523 ] Jan Høydahl commented on LUCENE-6336: - [~mikemccand] what's your view on this one? > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: Trunk, 5.1 > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346900#comment-14346900 ] Jan Høydahl commented on LUCENE-6336: - bq. Not a bug: DocumentDictionary etc suggests documents, not terms. What you say implies that the field you use for the suggested terms must be 100% unique across the main document index. Suggesters are typically used to suggest e.g. authors, languages, categories... Indeed, the example in https://cwiki.apache.org/confluence/display/solr/Suggester does just this, suggesting categories using price field as weight, and {{DocumentDictionary}}. But FuzzySuggester is not index-based so it does not reveal this bug. > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: Trunk, 5.1 > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6336) AnalyzingInfixSuggester needs duplicate handling
[ https://issues.apache.org/jira/browse/LUCENE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346879#comment-14346879 ] Robert Muir commented on LUCENE-6336: - Not a bug: DocumentDictionary etc suggests documents, not terms. > AnalyzingInfixSuggester needs duplicate handling > > > Key: LUCENE-6336 > URL: https://issues.apache.org/jira/browse/LUCENE-6336 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.3, 5.0 >Reporter: Jan Høydahl > Fix For: Trunk, 5.1 > > Attachments: LUCENE-6336.patch > > > Spinoff from LUCENE-5833 but else unrelated. > Using {{AnalyzingInfixSuggester}} which is backed by a Lucene index and > stores payload and score together with the suggest text. > I did some testing with Solr, producing the DocumentDictionary from an index > with multiple documents containing the same text, but with random weights > between 0-100. Then I got duplicate identical suggestions sorted by weight: > {code} > { > "suggest":{"languages":{ > "engl":{ > "numFound":101, > "suggestions":[{ > "term":"English", > "weight":100, > "payload":"0"}, > { > "term":"English", > "weight":99, > "payload":"0"}, > { > "term":"English", > "weight":98, > "payload":"0"}, > ---etc all the way down to 0--- > {code} > I also reproduced the same behavior in AnalyzingInfixSuggester directly. So > there is a need for some duplicate removal here, either while building the > local suggest index or during lookup. Only the highest weight suggestion for > a given term should be returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org