Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Michael Froh Tue, 18 Feb 2020 17:34:05 -0800

In your example, it looks like you wanted the second term to match based on
the first character, or prefix, of the term.


While you could use a WildcardQuery with a term value of "term2FirstChar*",
PrefixQuery seemed like the simpler approach. WildcardQuery can handle more
general cases, like if you want to match on something like "a*b*c".

Technically, the PrefixQuery compiles down to a slightly simpler automaton,
but I only figured that out by writing a simple unit test:

    public void testAutomata() {
        Automaton prefixAutomaton = PrefixQuery.toAutomaton(new
BytesRef("a"));
        Automaton wildcardAutomaton = WildcardQuery.toAutomaton(new
Term("foo", "a*"));

        System.out.println("PrefixQuery(\"a\")");
        System.out.println(prefixAutomaton.toDot());
        System.out.println("WildcardQuery(\"a*\")");
        System.out.println(wildcardAutomaton.toDot());
    }

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U00000000-\\U000000ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U00000000-\\U0010ffff"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U00000000-\\U0010ffff"]
}



On Tue, 18 Feb 2020 at 13:52, <baris.ka...@oracle.com> wrote:

> Michael and Forum,-
> Thanks for thegreat explanations.
>
> one question please:
>
> why is PrefixQuery used instead of WildCardQuery in the below snippet?
>
> Best regards
>
> > On Feb 17, 2020, at 3:01 PM, Michael Froh <msf...@gmail.com> wrote:
> >
> > Hi Baris,
> >
> > The idea with PhraseWildcardQuery is that you can mix literal "exact"
> terms
> > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using addTerm is
> > for exact terms, while addMultiTerm is for things that may match a number
> > of possible terms in the given position.
> >
> > If you want to search for term1 followed by any term that starts with a
> > given character, I would suggest using:
> >
> > int maxMultiTermExpansions = ...; // Discussed below
> > PhraseWildCardQuery.Builder builder = new PhraseWildcardQuery("field",
> > maxMultiTermExpansions);
> > builder.addTerm(new BytesRef("term1")); // Add fixed term in position 0
> > builder.addMultiTerm(new PrefixQuery(new Term("field",
> "term2FirstChar")));
> > // Add multiterm in position 1
> > Query q = builder.build();
> >
> > The PrefixQuery effectively gets expanded into a bunch of possible terms,
> > based on the term dictionary on each index segment. To avoid expanding to
> > cover too many terms (say, if you added a bunch of WildcardQuery),
> > maxMultiTermExpansions serves as a guard rail, to put a rough bound on
> > memory consumption and query execution time. If you're interested in
> > details of how the maxMultiTermExpansions budget is distributed across
> > MultiTerms, check out PhraseWildcardQuery.createWeight. If you're just
> > running an experiment in your IDE, you could probably set
> > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running in a
> > production environment, it's likely a good idea to tune it down based on
> > your memory/latency constraints.)
> >
> > Incidentally, for tracking down the source code for anything in Lucene,
> > it's probably better to go to GitHub for the most up-to-date source:
> >
> https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$
> > .
> >
> > Hope that helps,
> > Michael
> >
> >> On Thu, 13 Feb 2020 at 12:29, <baris.ka...@oracle.com> wrote:
> >>
> >> Hi,-
> >>
> >> i hope everyone is doing great.
> >>
> >>  if i want to do the following search with PhraseWildCardQuery and
> >> thanks to this forum for letting me know about this class (Especially to
> >> David and Bruno)
> >>
> >> term1 term2FirstChar*
> >>
> >> i need to do two ways: (i found the source code at
> >>
> >>
> https://urldefense.com/v3/__https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIpV8n29nQ$
> >> )
> >>
> >> /*
> >>
> >> maxMultiTermExpansions - The maximum number of expansions across all
> >> multi-terms and across all segments. It counts expansions for each
> >> segments individually, that allows optimizations per segment and unused
> >> expansions are credited to next segments. This is different from
> >> MultiPhraseQuery and SpanMultiTermQueryWrapper which have an expansion
> >> limit per multi-term.
> >>
> >> segmentOptimizationEnabled - Whether to enable the segment optimization
> >> which consists in ignoring a segment for further analysis as soon as a
> >> term is not present inside it. This optimizes the query execution
> >> performance but changes the scoring. The result ranking is preserved.
> >>
> >> */
> >>
> >>
> >> 1st way:
> >>
> >> PhraseWildCardQuery.Builder builder = PharseWildCardQuery.Builder(field,
> >> 2 _*/<<< i dont know what number to use here for
> >> maxMultiTermExpansions>>>/*_, true/*boolean
> segmentOptimizationEnabled*/)
> >>
> >> pwcqBuilder.addTerm(field, new Term(field, "term1"));
> >>
> >> pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar"));
> >>
> >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
> >>
> >> or
> >>
> >> 2nd way:
> >>
> >> pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning {field,
> >> "term1"} and {field ,"term2FirstChar"});
> >>
> >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
> >>
> >>
> >> Then this pwcq object will be fed into IndexSearcher's as the query
> >> parameter.
> >>
> >>
> >> Now, it looks like the first way will not consider expansions or in
> >> other words wildcard? Am i right?
> >>
> >> i also need to understand this maxMultiTermExpansions parameter better.
> >> For instance if first way is used, will maxMultiTermExpansions be
> >> meaningful?
> >>
> >>
> >> Thanks
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Reply via email to