Re: How does auto-generating phrases work?

Mikhail Khludnev Sat, 07 Mar 2026 13:18:10 -0800

Hi Kai,
Pls remind me if you use SynonymGraphFilter or SynonymFilter ? and which
version do you use?
Just a quick answer, If it parses just two words, it's reasonable to yield
a boolean query.
and with useOrig=false, these two words are replaced to another ones,
presumably it ok just yield a boolean query too.
I think it goes somewhere
https://github.com/apache/lucene/blob/c47ccd83da7692d2e7fa207eaca14975a614065f/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L555



However, useOrig=true, thus these two words should be extended with a pair
of others overlapping'em, it might make sense to prohibit permutation
between pairs with phrases. It probably goes here
https://github.com/apache/lucene/blob/c47ccd83da7692d2e7fa207eaca14975a614065f/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L539
I'm definitely missing something.

With regards to summing score from matching phrases, I suppose there's no
option besides implementing custom parser producing DisMaxQuery.

On Fri, Mar 6, 2026 at 6:05 PM Kai Grossjohann <[email protected]>
wrote:

> Cycling back on this one...  I'm in a bit of a bind now.
>
> Using a SynonymMap with useOrig=true, the phrase recognition works:
>
> CharsRef canonical = createCharsRef("canonical phrase");
> CharsRef alias = createCharsRef("alias phrase");
> builder.add(canonical, canonical, true);
> builder.add(alias, canonical, true);
>
> However, if I parse the string "alias phrase", then I get as query:
> foo:"canonical phrase" foo:"alias phrase"
>
> This results in skewed scores, as another document that contains both of
> them scores higher.  The score is better with useOrig=false (third
> parameter of builder.add), but then phrase recognition no longer works:
> The string "alias phrase" now results in the query: foo:"canonical"
> foo:"phrase"
>
> It feels to me that this is a bug, and phrase recognition should also
> work with useOrig=false.
>
> What do people think?
>
> Thanks,
> Kai
>
> On 2025-11-26 14:43, Kai Grossjohann wrote:
> >
> > Thank you Mikhail, very interesting.  It has taken me a long time to
> > reply because I got other priorities...
> >
> > With “enable position increments” it works much better.  “Split on
> > whitespace” has to be false (as you say) and “auto-generate phrase
> > queries” also has to be false.  But interestingly enough,
> > “auto-generate multi-term synonyms phrase query” can be true, and
> > setting it to true helps.
> >
> > This is now good enough for my actual application code.  I do still
> > see some oddities.  One of them is hopefully more cosmetic, and the
> > other can be worked around.
> >
> > I will work around the following behavior:
> >
> >   * If a phrase appears as the /output/, but not as the /input/, of a
> >     SynonymMap entry, then it is /not/ automatically recognized.
> >   * A phrase that appears as the input of a SynonymMap entry is
> >     automatically recognized.
> >
> > “My” synonyms are structured in such a way that there is a canonical
> > term and multiple possible alias terms.  My understanding was that I
> > should have one SynonymMap entry per alias term, each of them
> > specifying the alias term as input and the canonical term as output.
> > I will work around the problem by adding another SynonymMap entry,
> > specifying the canonical term as both input and output.
> >
> >   * If I map a phrase to itself (i.e. both input and output) then it's
> >     doubled in the resulting query.
> >
> > The workaround above means that the canonical terms are doubled in the
> > query, but I'm just going to live with that.  I hope it doesn't skew
> > the weights too bad.
> >
> > Kai
> >
> >
> > On 2025-11-03 21:38, Mikhail Khludnev wrote:
> >> Hello Kai
> >>
> >> Pardon for vide coding, but this sample
> >>
> https://github.com/mkhludnev/mutlyword-phrase-query-test/blob/3e3f1cce6b2b6790970e4a042ddb2967e49d0077/src/test/java/org/example/phrases/MultiWordTests.java#L88
> >>
> >>
> >> parses plain biword "power grid" without quotes as a bool/should of
> >> phrases
> >>
> >>
> org.example.phrases.MultiWordTests#testPhraseQueryGeneratedFromPlainMultiWordSynonym
> >>
> >> Parsed Query for 'power grid': ("electrical grid" "power grid")
> >> Does it look closer to what you are looking for?
> >>
> >>
> >> On Mon, Nov 3, 2025 at 1:50 PM Kai Grossjohann
> >> <[email protected]> wrote:
> >>
> >>     Hi Mikhail,
> >>
> >>     I tried to change this to false, and this was the result:
> >>
> >>     java.lang.IllegalArgumentException:
> >>     setAutoGeneratePhraseQueries(true) is disallowed when
> >>     getSplitOnWhitespace() == false
> >>
> >>     I experimented with other combinations of setSplitOnWhitespace,
> >>     setAutoGeneratePhraseQueries, and
> >>     setAutoGenerateMultiTermSynonymsPhraseQuery.  None of them got me
> >>     the phrase queries I'm looking for.  Though some of them searched
> >>     for more synonyms.
> >>
> >>     In particular, false/false/true resulted in “synonym alias” being
> >>     parsed as Synonym(foo:canonical foo:synonym) Synonym(foo:alias
> >>     foo:phrase) which still doesn't produce the foo:"canonical
> >>     phrase"~1 that I was looking for.
> >>
> >>     Kai
> >>
> >>     On 2025-10-30 18:01, Mikhail Khludnev wrote:
> >>>     Hello Kaj
> >>>
> >>>     Briefly skimming through the letter
> >>>
> >>>               queryParser.setSplitOnWhitespace(true); // shouldn't
> false be here
> >>>     ?
> >>>               queryParser.setAutoGeneratePhraseQueries(true);
> >>>     queryParser.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
> >>>               queryParser.setPhraseSlop(1);
> >>>
> >>>               Query q = queryParser.parse("canonical phrase");
> >>>               assertEquals("foo:canonical foo:phrase", q.toString(),
> >>>                       "I was expecting a phrase query here:
> foo:\"canonical
> >>>     phrase\"~1");
> >>>
> >>>
> >>>
> >>>     On Thu, Oct 30, 2025 at 4:49 PM Kai Grossjohann
> >>>     <[email protected]> <mailto:
> [email protected]> wrote:
> >>>
> >>>>     I thought if I have a synonym map that says “synonym alias” is an
> alias
> >>>>     for “canonical phrase”, and I noodle “canonical phrase” through
> the
> >>>>     query parser, telling it to auto generate multi term queries, I'd
> get a
> >>>>     multi term query.  But that doesn't seem to be the case.
> >>>>
> >>>>     The only way to generate multi term queries seems to be when the
> synonym
> >>>>     says that “shortsyn” is an alias for “another phrase”, and then
> noodle
> >>>>     “shortsyn” through the query parser.  Then I get foo:"another
> phrase"~1
> >>>>     which is what I expected.
> >>>>
> >>>>     My use case is as follows: I have some multi-word strings, and I
> need to
> >>>>     create queries from them.  And if one of the synonym phrases
> appears in
> >>>>     the multi-word string, then I would like to generate a phrase
> query for
> >>>>     that part.  For example, given the synonyms mentioned above, if
> the
> >>>>     multi-word string is, say, “my synonym alias is nice”, then I'd
> like to
> >>>>     generate a query that searches for the word “my”, the phrase
> “canonical
> >>>>     phrase”, and the words “is” and “nice”.  Maybe I would like to
> >>>>     /also/ search for the words “synonym” and “alias”, or the words
> >>>>     “canonical” and “phrase”, or all four of them, I'm not sure.
> >>>>
> >>>>     This description left out quite a bit of information, I'll paste
> some
> >>>>     code below to clarify.
> >>>>
> >>>>     Kai
> >>>>
> >>>>     /**
> >>>>        * This tests the behavior of the Lucene query
> >>>>        * builder with synonyms
> >>>>        */
> >>>>     public class SynonymGraphQueryBuilderTest {
> >>>>
> >>>>           private static class MyAnalyzer extends Analyzer {
> >>>>               private final CharArraySet stopwords;
> >>>>               private final SynonymMap synonyms;
> >>>>
> >>>>               public MyAnalyzer(Set<String> stopwords, SynonymMap
> synonyms) {
> >>>>                   this.stopwords = new CharArraySet(stopwords, true);
> >>>>                   this.synonyms = synonyms;
> >>>>               }
> >>>>
> >>>>               @Override
> >>>>               protected TokenStreamComponents createComponents(String
> >>>>     fieldName) {
> >>>>                   final Tokenizer src = new
> SimplePatternTokenizer("[a-z0-9]+");
> >>>>                   TokenStream tok = new LowerCaseFilter(src);
> >>>>                   tok = new SynonymGraphFilter(tok, synonyms, true);
> >>>>                   tok = new FlattenGraphFilter(tok);
> >>>>                   tok = new StopFilter(tok, stopwords);
> >>>>                   return new TokenStreamComponents(
> >>>>                           src::setReader,
> >>>>                           tok);
> >>>>               }
> >>>>           }
> >>>>
> >>>>           @Test
> >>>>           void testSynonymPhrases() throws Exception {
> >>>>               Builder builder = new Builder();
> >>>>
> >>>>               // canonical phrase <- synonym alias
> >>>>               CharsRef canonical = Builder.join(new String[] {
> "canonical",
> >>>>     "phrase" }, new CharsRefBuilder());
> >>>>               CharsRef synonym = Builder.join(new String[] {
> "synonym",
> >>>>     "alias" }, new CharsRefBuilder());
> >>>>               builder.add(synonym, canonical, true);
> >>>>
> >>>>               // another phrase <- shortsyn
> >>>>               canonical = Builder.join(new String[] { "another",
> "phrase" },
> >>>>     new CharsRefBuilder());
> >>>>               synonym = Builder.join(new String[] { "shortsyn" }, new
> >>>>     CharsRefBuilder());
> >>>>               builder.add(synonym, canonical, true);
> >>>>
> >>>>               SynonymMap synonyms = builder.build();
> >>>>
> >>>>               Set<String> stopwords = Set.of("the");
> >>>>
> >>>>               MyAnalyzer analyzer = new MyAnalyzer(stopwords,
> synonyms);
> >>>>
> >>>>               QueryParser queryParser = new QueryParser("foo",
> analyzer);
> >>>>               queryParser.setSplitOnWhitespace(true);
> >>>>               queryParser.setAutoGeneratePhraseQueries(true);
> >>>>     queryParser.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
> >>>>               queryParser.setPhraseSlop(1);
> >>>>
> >>>>               Query q = queryParser.parse("canonical phrase");
> >>>>               assertEquals("foo:canonical foo:phrase", q.toString(),
> >>>>                       "I was expecting a phrase query here:
> foo:\"canonical
> >>>>     phrase\"~1");
> >>>>
> >>>>               q = queryParser.parse("synonym alias");
> >>>>               assertEquals("foo:synonym foo:alias", q.toString(),
> >>>>                       "I was expecting a phrase query here:
> foo:\"canonical
> >>>>     phrase\"~1");
> >>>>
> >>>>               q = queryParser.parse("shortsyn");
> >>>>               assertEquals("foo:\"another phrase\"~1 foo:shortsyn",
> >>>>     q.toString(),
> >>>>                       "This is what I expected.");
> >>>>
> >>>>               q = queryParser.parse("another phrase");
> >>>>               assertEquals("foo:another foo:phrase", q.toString(),
> >>>>                       "I was expecting a phrase query here:
> foo:\"another
> >>>>     phrase\"~1");
> >>>>           }
> >>>>     }
> >>>>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev



-- 
Sincerely yours
Mikhail Khludnev

Re: How does auto-generating phrases work?

Reply via email to