Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

baris . kazar Tue, 18 Feb 2020 17:50:01 -0800

Michael and Forum,-
This is amazing, thanks.

i will try both cases.


i can also have "term1 term2Char1term2Char2*"
and so on with term2's next chars.

I hope the latest version on github for this
class works with Lucene Version 7.7.2.

Best regards

> On Feb 18, 2020, at 8:33 PM, Michael Froh <[email protected]> wrote:
> 
> 
> In your example, it looks like you wanted the second term to match based on 
> the first character, or prefix, of the term. 
> 
> While you could use a WildcardQuery with a term value of "term2FirstChar*", 
> PrefixQuery seemed like the simpler approach. WildcardQuery can handle more 
> general cases, like if you want to match on something like "a*b*c".
> 
> Technically, the PrefixQuery compiles down to a slightly simpler automaton, 
> but I only figured that out by writing a simple unit test:
> 
>     public void testAutomata() {
>         Automaton prefixAutomaton = PrefixQuery.toAutomaton(new 
> BytesRef("a"));
>         Automaton wildcardAutomaton = WildcardQuery.toAutomaton(new 
> Term("foo", "a*"));
> 
>         System.out.println("PrefixQuery(\"a\")");
>         System.out.println(prefixAutomaton.toDot());
>         System.out.println("WildcardQuery(\"a*\")");
>         System.out.println(wildcardAutomaton.toDot());
>     }
> 
> That produces the following output:
> 
> PrefixQuery("a")
> digraph Automaton {
>   rankdir = LR
>   node [width=0.2, height=0.2, fontsize=8]
>   initial [shape=plaintext,label=""]
>   initial -> 0
>   0 [shape=circle,label="0"]
>   0 -> 1 [label="a"]
>   1 [shape=doublecircle,label="1"]
>   1 -> 1 [label="\\U00000000-\\U000000ff"]
> }
> WildcardQuery("a*")
> digraph Automaton {
>   rankdir = LR
>   node [width=0.2, height=0.2, fontsize=8]
>   initial [shape=plaintext,label=""]
>   initial -> 0
>   0 [shape=circle,label="0"]
>   0 -> 1 [label="a"]
>   1 [shape=doublecircle,label="1"]
>   1 -> 2 [label="\\U00000000-\\U0010ffff"]
>   2 [shape=doublecircle,label="2"]
>   2 -> 2 [label="\\U00000000-\\U0010ffff"]
> }
> 
> 
> 
>> On Tue, 18 Feb 2020 at 13:52, <[email protected]> wrote:
>> Michael and Forum,-
>> Thanks for thegreat explanations.
>> 
>> one question please:
>> 
>> why is PrefixQuery used instead of WildCardQuery in the below snippet?
>> 
>> Best regards
>> 
>> > On Feb 17, 2020, at 3:01 PM, Michael Froh <[email protected]> wrote:
>> > 
>> > Hi Baris,
>> > 
>> > The idea with PhraseWildcardQuery is that you can mix literal "exact" terms
>> > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using addTerm is
>> > for exact terms, while addMultiTerm is for things that may match a number
>> > of possible terms in the given position.
>> > 
>> > If you want to search for term1 followed by any term that starts with a
>> > given character, I would suggest using:
>> > 
>> > int maxMultiTermExpansions = ...; // Discussed below
>> > PhraseWildCardQuery.Builder builder = new PhraseWildcardQuery("field",
>> > maxMultiTermExpansions);
>> > builder.addTerm(new BytesRef("term1")); // Add fixed term in position 0
>> > builder.addMultiTerm(new PrefixQuery(new Term("field", "term2FirstChar")));
>> > // Add multiterm in position 1
>> > Query q = builder.build();
>> > 
>> > The PrefixQuery effectively gets expanded into a bunch of possible terms,
>> > based on the term dictionary on each index segment. To avoid expanding to
>> > cover too many terms (say, if you added a bunch of WildcardQuery),
>> > maxMultiTermExpansions serves as a guard rail, to put a rough bound on
>> > memory consumption and query execution time. If you're interested in
>> > details of how the maxMultiTermExpansions budget is distributed across
>> > MultiTerms, check out PhraseWildcardQuery.createWeight. If you're just
>> > running an experiment in your IDE, you could probably set
>> > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running in a
>> > production environment, it's likely a good idea to tune it down based on
>> > your memory/latency constraints.)
>> > 
>> > Incidentally, for tracking down the source code for anything in Lucene,
>> > it's probably better to go to GitHub for the most up-to-date source:
>> > https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$
>> >  
>> > .
>> > 
>> > Hope that helps,
>> > Michael
>> > 
>> >> On Thu, 13 Feb 2020 at 12:29, <[email protected]> wrote:
>> >> 
>> >> Hi,-
>> >> 
>> >> i hope everyone is doing great.
>> >> 
>> >>  if i want to do the following search with PhraseWildCardQuery and
>> >> thanks to this forum for letting me know about this class (Especially to
>> >> David and Bruno)
>> >> 
>> >> term1 term2FirstChar*
>> >> 
>> >> i need to do two ways: (i found the source code at
>> >> 
>> >> https://urldefense.com/v3/__https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIpV8n29nQ$
>> >>  
>> >> )
>> >> 
>> >> /*
>> >> 
>> >> maxMultiTermExpansions - The maximum number of expansions across all
>> >> multi-terms and across all segments. It counts expansions for each
>> >> segments individually, that allows optimizations per segment and unused
>> >> expansions are credited to next segments. This is different from
>> >> MultiPhraseQuery and SpanMultiTermQueryWrapper which have an expansion
>> >> limit per multi-term.
>> >> 
>> >> segmentOptimizationEnabled - Whether to enable the segment optimization
>> >> which consists in ignoring a segment for further analysis as soon as a
>> >> term is not present inside it. This optimizes the query execution
>> >> performance but changes the scoring. The result ranking is preserved.
>> >> 
>> >> */
>> >> 
>> >> 
>> >> 1st way:
>> >> 
>> >> PhraseWildCardQuery.Builder builder = PharseWildCardQuery.Builder(field,
>> >> 2 _*/<<< i dont know what number to use here for
>> >> maxMultiTermExpansions>>>/*_, true/*boolean segmentOptimizationEnabled*/)
>> >> 
>> >> pwcqBuilder.addTerm(field, new Term(field, "term1"));
>> >> 
>> >> pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar"));
>> >> 
>> >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
>> >> 
>> >> or
>> >> 
>> >> 2nd way:
>> >> 
>> >> pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning {field,
>> >> "term1"} and {field ,"term2FirstChar"});
>> >> 
>> >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
>> >> 
>> >> 
>> >> Then this pwcq object will be fed into IndexSearcher's as the query
>> >> parameter.
>> >> 
>> >> 
>> >> Now, it looks like the first way will not consider expansions or in
>> >> other words wildcard? Am i right?
>> >> 
>> >> i also need to understand this maxMultiTermExpansions parameter better.
>> >> For instance if first way is used, will maxMultiTermExpansions be
>> >> meaningful?
>> >> 
>> >> 
>> >> Thanks
>> >> 
>> >> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Reply via email to