[ https://issues.apache.org/jira/browse/LUCENE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373058#comment-16373058 ]
Steve Rowe commented on LUCENE-8137: ------------------------------------ A test showing the problem - MockSynonymFilter has synonym "cavy" for "guinea pig", and the anonymous analyzer below has "pig" on its stoplist. QueryBuilder produces a query for only "cavy", even though the token stream also contains "guinea": {code:java|title=TestQueryBuilder.java} public void testGraphStop() { Query syn1 = new TermQuery(new Term("field", "guinea")); Query syn2 = new TermQuery(new Term("field", "cavy")); BooleanQuery synQuery = new BooleanQuery.Builder() .add(syn1, BooleanClause.Occur.SHOULD) .add(syn2, BooleanClause.Occur.SHOULD) .build(); BooleanQuery expectedGraphQuery = new BooleanQuery.Builder() .add(synQuery, BooleanClause.Occur.SHOULD) .build(); QueryBuilder queryBuilder = new QueryBuilder(new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { MockTokenizer tokenizer = new MockTokenizer(); TokenStream stream = new MockSynonymFilter(tokenizer); stream = new StopFilter(stream, CharArraySet.copy(Collections.singleton("pig"))); return new TokenStreamComponents(tokenizer, stream); } }); queryBuilder.setAutoGenerateMultiTermSynonymsPhraseQuery(true); assertEquals(expectedGraphQuery, queryBuilder.createBooleanQuery("field", "guinea pig", BooleanClause.Occur.SHOULD)); } } {code} > GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word > synoyms > ------------------------------------------------------------------------------------ > > Key: LUCENE-8137 > URL: https://issues.apache.org/jira/browse/LUCENE-8137 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: master (8.0), 7.2.1 > Reporter: Jim Ferenczi > Assignee: Jim Ferenczi > Priority: Major > > The automaton built for graph queries that contain multiple multi-word > synonyms does not handle gaps if they appear in the middle of a multi-word > synonym. In such case the token next to the gap is considered as part of the > multi-word synonym. > Stop words that appear before or after multi-word synonyms are handled > correctly in the current version but the synonym rule "part of speech, pos" > for instance does not create the expected query if "of" is removed by a > filter that is set after the synonym_graph. One solution would be to reuse > TokenStreamToAutomaton (with minor changes to add the ability to create token > transitions rather than chars) which preserves gaps (as a transition) in the > produced automaton. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org