Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

baris . kazar Wed, 19 Feb 2020 14:41:30 -0800

Hi,-

is there a JAR file for the classes in thehttps://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/searchand index and analysis directories?

https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/searchdoes not have PhraseWildcardQuery class, though.


As Michael mentioned, i pulled it from

https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java

It seems that many classes in these directories are incompatible withLucene Version 7.7.2. Probably these are from Lucene 8.x series.

It will be very nice to have a JAR file to be able to use all theseclasses together with Lucene 7.x versions.



Best regards


On 2/19/20 3:42 PM, [email protected] wrote:

Hi,-

Thanks again Michael, David and Bruno and the Forum for letting meknow this repository.

The version of PhraseWildCardQuery onhttps://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDByacVbfA$uses some classes not available Lucene version 7.7.2.

There is a bunch of new and modified classes used byPhraseWildCardquery class such as QueryVisitor, ScoreMode etc.

I will try to add these classes fromhttps://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDDFbpGxRQ$

and i hope it will work with Lucene 7.7.2.

Best regards



On 2/18/20 8:49 PM, [email protected] wrote:

Michael and Forum,-
This is amazing, thanks.

i will try both cases.

i can also have "term1 term2Char1term2Char2*"
and so on with term2's next chars.

I hope the latest version on github for this
class works with Lucene Version 7.7.2.

Best regards

On Feb 18, 2020, at 8:33 PM, Michael Froh <[email protected]> wrote:

In your example, it looks like you wanted the second term to matchbased on the first character, or prefix, of the term.

While you could use a WildcardQuery with a term value of"term2FirstChar*", PrefixQuery seemed like the simpler approach.WildcardQuery can handle more general cases, like if you want tomatch on something like "a*b*c".

Technically, the PrefixQuery compiles down to a slightly simplerautomaton, but I only figured that out by writing a simple unit test:


    public void testAutomata() {

Automaton prefixAutomaton = PrefixQuery.toAutomaton(newBytesRef("a")); Automaton wildcardAutomaton = WildcardQuery.toAutomaton(newTerm("foo", "a*"));


        System.out.println("PrefixQuery(\"a\")");
        System.out.println(prefixAutomaton.toDot());
        System.out.println("WildcardQuery(\"a*\")");
        System.out.println(wildcardAutomaton.toDot());
    }

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U00000000-\\U000000ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U00000000-\\U0010ffff"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U00000000-\\U0010ffff"]
}

On Tue, 18 Feb 2020 at 13:52, <[email protected]<mailto:[email protected]>> wrote:


    Michael and Forum,-
    Thanks for thegreat explanations.

    one question please:

    why is PrefixQuery used instead of WildCardQuery in the below
    snippet?

    Best regards

    > On Feb 17, 2020, at 3:01 PM, Michael Froh <[email protected]
    <mailto:[email protected]>> wrote:
    >
    > Hi Baris,
    >
    > The idea with PhraseWildcardQuery is that you can mix literal
    "exact" terms
    > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using
    addTerm is
    > for exact terms, while addMultiTerm is for things that may
    match a number
    > of possible terms in the given position.
    >
    > If you want to search for term1 followed by any term that
    starts with a
    > given character, I would suggest using:
    >
    > int maxMultiTermExpansions = ...; // Discussed below
    > PhraseWildCardQuery.Builder builder = new
    PhraseWildcardQuery("field",
    > maxMultiTermExpansions);
    > builder.addTerm(new BytesRef("term1")); // Add fixed term in
    position 0
    > builder.addMultiTerm(new PrefixQuery(new Term("field",
    "term2FirstChar")));
    > // Add multiterm in position 1
    > Query q = builder.build();
    >
    > The PrefixQuery effectively gets expanded into a bunch of
    possible terms,
    > based on the term dictionary on each index segment. To avoid
    expanding to

> cover too many terms (say, if you added a bunch ofWildcardQuery),

    > maxMultiTermExpansions serves as a guard rail, to put a rough
    bound on
    > memory consumption and query execution time. If you're
    interested in
    > details of how the maxMultiTermExpansions budget is distributed
    across
    > MultiTerms, check out PhraseWildcardQuery.createWeight. If
    you're just
    > running an experiment in your IDE, you could probably set
    > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running
    in a
    > production environment, it's likely a good idea to tune it down
    based on
    > your memory/latency constraints.)
    >
    > Incidentally, for tracking down the source code for anything in
    Lucene,
    > it's probably better to go to GitHub for the most up-to-date
    source:
    >
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$

    > .
    >
    > Hope that helps,
    > Michael
    >
    >> On Thu, 13 Feb 2020 at 12:29, <[email protected]
    <mailto:[email protected]>> wrote:
    >>
    >> Hi,-
    >>
    >> i hope everyone is doing great.
    >>

>> if i want to do the following search withPhraseWildCardQuery and

    >> thanks to this forum for letting me know about this class
    (Especially to
    >> David and Bruno)
    >>
    >> term1 term2FirstChar*
    >>
    >> i need to do two ways: (i found the source code at
    >>
    >>
https://urldefense.com/v3/__https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIpV8n29nQ$

    >> )
    >>
    >> /*
    >>
    >> maxMultiTermExpansions - The maximum number of expansions
    across all

>> multi-terms and across all segments. It counts expansions foreach

    >> segments individually, that allows optimizations per segment
    and unused
    >> expansions are credited to next segments. This is different from
    >> MultiPhraseQuery and SpanMultiTermQueryWrapper which have an
    expansion
    >> limit per multi-term.
    >>
    >> segmentOptimizationEnabled - Whether to enable the segment
    optimization
    >> which consists in ignoring a segment for further analysis as
    soon as a

>> term is not present inside it. This optimizes the queryexecution

    >> performance but changes the scoring. The result ranking is
    preserved.
    >>
    >> */
    >>
    >>
    >> 1st way:
    >>
    >> PhraseWildCardQuery.Builder builder =
    PharseWildCardQuery.Builder(field,
    >> 2 _*/<<< i dont know what number to use here for
    >> maxMultiTermExpansions>>>/*_, true/*boolean
    segmentOptimizationEnabled*/)
    >>
    >> pwcqBuilder.addTerm(field, new Term(field, "term1"));
    >>
    >> pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar"));
    >>
    >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
    >>
    >> or
    >>
    >> 2nd way:
    >>
    >> pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning
    {field,
    >> "term1"} and {field ,"term2FirstChar"});
    >>
    >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
    >>
    >>
    >> Then this pwcq object will be fed into IndexSearcher's as the
    query
    >> parameter.
    >>
    >>
    >> Now, it looks like the first way will not consider expansions
    or in
    >> other words wildcard? Am i right?
    >>
    >> i also need to understand this maxMultiTermExpansions
    parameter better.

>> For instance if first way is used, willmaxMultiTermExpansions be

    >> meaningful?
    >>
    >>
    >> Thanks
    >>
    >>


---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    <mailto:[email protected]>
    For additional commands, e-mail: [email protected]
    <mailto:[email protected]>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Reply via email to