Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

baris . kazar Wed, 26 Feb 2020 15:12:17 -0800

Followup on this thread:

i ended up using WildcardQuery with "*" at the end of last token forPhraseWildcardQuery class from the sandbox,

i tested this class rigorously and i think it is ready to move it fromsandbox jar to the appropriate release jar.


Is there a plan for that?

PhraseWildcardQuery is on ave 3-4 times faster thanComplexPhraseQueryParser class and gives same result.

I did some more naive enhancements on top of ComplexPhraseQueryParserresults and i plan to do those with this new class which will bring downthe execution another 2 to 3 times more.



Best regards


On 2/21/20 12:34 PM, baris.ka...@oracle.com wrote:

Hi,-
Looks like the only way to use and test the new PhraseWildCardQueryclass in Lucene 8.4.0 sandbox is to switch to Lucene 8.4.0 from Lucene7.7.2.
I thought i could adapt it to Lucene 7.7.2 but so far i saw i neededto change heavily 20+ classes and it will be way more than this.
So, if anybody wants to use this new amazing class You need to onLucene 8.4.0.
http://lucene.apache.org/core/8_4_0/sandbox/index.html

Best regards


On 2/19/20 5:41 PM, baris.ka...@oracle.com wrote:
Hi,-
is there a JAR file for the classes in thehttps://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/searchand index and analysis directories?
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/searchdoes not have PhraseWildcardQuery class, though.
As Michael mentioned, i pulled it from
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java
It seems that many classes in these directories are incompatible withLucene Version 7.7.2. Probably these are from Lucene 8.x series.
It will be very nice to have a JAR file to be able to use all theseclasses together with Lucene 7.x versions.
Best regards


On 2/19/20 3:42 PM, baris.ka...@oracle.com wrote:
Hi,-
Thanks again Michael, David and Bruno and the Forum for letting meknow this repository.
The version of PhraseWildCardQuery onhttps://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDByacVbfA$uses some classes not available Lucene version 7.7.2.
There is a bunch of new and modified classes used byPhraseWildCardquery class such as QueryVisitor, ScoreMode etc.
I will try to add these classes fromhttps://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDDFbpGxRQ$
and i hope it will work with Lucene 7.7.2.

Best regards



On 2/18/20 8:49 PM, baris.ka...@oracle.com wrote:
Michael and Forum,-
This is amazing, thanks.

i will try both cases.

i can also have "term1 term2Char1term2Char2*"
and so on with term2's next chars.

I hope the latest version on github for this
class works with Lucene Version 7.7.2.

Best regards
On Feb 18, 2020, at 8:33 PM, Michael Froh <msf...@gmail.com> wrote:
In your example, it looks like you wanted the second term to matchbased on the first character, or prefix, of the term.
While you could use a WildcardQuery with a term value of"term2FirstChar*", PrefixQuery seemed like the simpler approach.WildcardQuery can handle more general cases, like if you want tomatch on something like "a*b*c".
Technically, the PrefixQuery compiles down to a slightly simplerautomaton, but I only figured that out by writing a simple unit test:
    public void testAutomata() {
Automaton prefixAutomaton = PrefixQuery.toAutomaton(newBytesRef("a")); Automaton wildcardAutomaton =WildcardQuery.toAutomaton(new Term("foo", "a*"));
        System.out.println("PrefixQuery(\"a\")");
        System.out.println(prefixAutomaton.toDot());
        System.out.println("WildcardQuery(\"a*\")");
        System.out.println(wildcardAutomaton.toDot());
    }

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U00000000-\\U000000ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U00000000-\\U0010ffff"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U00000000-\\U0010ffff"]
}
On Tue, 18 Feb 2020 at 13:52, <baris.ka...@oracle.com<mailto:baris.ka...@oracle.com>> wrote:
    Michael and Forum,-
    Thanks for thegreat explanations.

    one question please:

    why is PrefixQuery used instead of WildCardQuery in the below
    snippet?

    Best regards

    > On Feb 17, 2020, at 3:01 PM, Michael Froh <msf...@gmail.com
    <mailto:msf...@gmail.com>> wrote:
    >
    > Hi Baris,
    >
    > The idea with PhraseWildcardQuery is that you can mix literal
    "exact" terms
    > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using
    addTerm is
    > for exact terms, while addMultiTerm is for things that may
    match a number
    > of possible terms in the given position.
    >
    > If you want to search for term1 followed by any term that
    starts with a
    > given character, I would suggest using:
    >
    > int maxMultiTermExpansions = ...; // Discussed below
    > PhraseWildCardQuery.Builder builder = new
    PhraseWildcardQuery("field",
    > maxMultiTermExpansions);
    > builder.addTerm(new BytesRef("term1")); // Add fixed term in
    position 0
    > builder.addMultiTerm(new PrefixQuery(new Term("field",
    "term2FirstChar")));
    > // Add multiterm in position 1
    > Query q = builder.build();
    >
    > The PrefixQuery effectively gets expanded into a bunch of
    possible terms,
    > based on the term dictionary on each index segment. To avoid
    expanding to
> cover too many terms (say, if you added a bunch ofWildcardQuery),
    > maxMultiTermExpansions serves as a guard rail, to put a rough
    bound on
    > memory consumption and query execution time. If you're
    interested in
    > details of how the maxMultiTermExpansions budget is distributed
    across
    > MultiTerms, check out PhraseWildcardQuery.createWeight. If
    you're just
    > running an experiment in your IDE, you could probably set
    > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running
    in a
    > production environment, it's likely a good idea to tune it down
    based on
    > your memory/latency constraints.)
    >
    > Incidentally, for tracking down the source code for anything in
    Lucene,
    > it's probably better to go to GitHub for the most up-to-date
    source:
    >
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$
    > .
    >
    > Hope that helps,
    > Michael
    >
    >> On Thu, 13 Feb 2020 at 12:29, <baris.ka...@oracle.com
    <mailto:baris.ka...@oracle.com>> wrote:
    >>
    >> Hi,-
    >>
    >> i hope everyone is doing great.
    >>
>> if i want to do the following search withPhraseWildCardQuery and
    >> thanks to this forum for letting me know about this class
    (Especially to
    >> David and Bruno)
    >>
    >> term1 term2FirstChar*
    >>
    >> i need to do two ways: (i found the source code at
    >>
    >>
https://urldefense.com/v3/__https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIpV8n29nQ$
    >> )
    >>
    >> /*
    >>
    >> maxMultiTermExpansions - The maximum number of expansions
    across all
>> multi-terms and across all segments. It counts expansionsfor each
    >> segments individually, that allows optimizations per segment
    and unused
>> expansions are credited to next segments. This is differentfrom
    >> MultiPhraseQuery and SpanMultiTermQueryWrapper which have an
    expansion
    >> limit per multi-term.
    >>
    >> segmentOptimizationEnabled - Whether to enable the segment
    optimization
    >> which consists in ignoring a segment for further analysis as
    soon as a
>> term is not present inside it. This optimizes the queryexecution
    >> performance but changes the scoring. The result ranking is
    preserved.
    >>
    >> */
    >>
    >>
    >> 1st way:
    >>
    >> PhraseWildCardQuery.Builder builder =
    PharseWildCardQuery.Builder(field,
    >> 2 _*/<<< i dont know what number to use here for
    >> maxMultiTermExpansions>>>/*_, true/*boolean
    segmentOptimizationEnabled*/)
    >>
    >> pwcqBuilder.addTerm(field, new Term(field, "term1"));
    >>
    >> pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar"));
    >>
    >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
    >>
    >> or
    >>
    >> 2nd way:
    >>
    >> pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning
    {field,
    >> "term1"} and {field ,"term2FirstChar"});
    >>
    >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
    >>
    >>
    >> Then this pwcq object will be fed into IndexSearcher's as the
    query
    >> parameter.
    >>
    >>
    >> Now, it looks like the first way will not consider expansions
    or in
    >> other words wildcard? Am i right?
    >>
    >> i also need to understand this maxMultiTermExpansions
    parameter better.
>> For instance if first way is used, willmaxMultiTermExpansions be
    >> meaningful?
    >>
    >>
    >> Thanks
    >>
    >>


---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
    For additional commands, e-mail: java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

Reply via email to