Followup on this thread:

i ended up using WildcardQuery with "*" at the end of last token for PhraseWildcardQuery class from the sandbox,


i tested this class rigorously and i think it is ready to move it from sandbox jar to the appropriate release jar.

Is there a plan for that?


PhraseWildcardQuery is on ave 3-4 times faster than ComplexPhraseQueryParser class and gives same result.


I did some more naive enhancements on top of ComplexPhraseQueryParser results and i plan to do those with this new class which will bring down the execution another 2 to 3 times more.


Best regards


On 2/21/20 12:34 PM, baris.ka...@oracle.com wrote:
Hi,-

 Looks like the only way to use and test the new PhraseWildCardQuery class in Lucene 8.4.0 sandbox is to switch to Lucene 8.4.0 from Lucene 7.7.2.

I thought i could adapt it to Lucene 7.7.2 but so far i saw i needed to change heavily 20+ classes and it will be way more than this.

So, if anybody wants to use this new amazing class You need to on Lucene 8.4.0.

http://lucene.apache.org/core/8_4_0/sandbox/index.html

Best regards


On 2/19/20 5:41 PM, baris.ka...@oracle.com wrote:
Hi,-

 is there a JAR file for the classes in the https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/search and index and analysis directories?

https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/search does not have PhraseWildcardQuery class, though.

As Michael mentioned, i pulled it from

https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java


It seems that many classes in these directories are incompatible with Lucene Version 7.7.2. Probably these are from Lucene 8.x series.

It will be very nice to have a JAR file to be able to use all these classes together with Lucene 7.x versions.


Best regards


On 2/19/20 3:42 PM, baris.ka...@oracle.com wrote:
Hi,-

Thanks again Michael, David and Bruno and the Forum for letting me know this repository.

The version of PhraseWildCardQuery on https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDByacVbfA$ uses some classes not available Lucene version 7.7.2.

There is a bunch of new and modified classes used by PhraseWildCardquery class such as QueryVisitor, ScoreMode etc.

I will try to add these classes from https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDDFbpGxRQ$
and i hope it will work with Lucene 7.7.2.

Best regards



On 2/18/20 8:49 PM, baris.ka...@oracle.com wrote:
Michael and Forum,-
This is amazing, thanks.

i will try both cases.

i can also have "term1 term2Char1term2Char2*"
and so on with term2's next chars.

I hope the latest version on github for this
class works with Lucene Version 7.7.2.

Best regards

On Feb 18, 2020, at 8:33 PM, Michael Froh <msf...@gmail.com> wrote:


In your example, it looks like you wanted the second term to match based on the first character, or prefix, of the term.

While you could use a WildcardQuery with a term value of "term2FirstChar*", PrefixQuery seemed like the simpler approach. WildcardQuery can handle more general cases, like if you want to match on something like "a*b*c".

Technically, the PrefixQuery compiles down to a slightly simpler automaton, but I only figured that out by writing a simple unit test:

    public void testAutomata() {
        Automaton prefixAutomaton = PrefixQuery.toAutomaton(new BytesRef("a"));         Automaton wildcardAutomaton = WildcardQuery.toAutomaton(new Term("foo", "a*"));

        System.out.println("PrefixQuery(\"a\")");
        System.out.println(prefixAutomaton.toDot());
        System.out.println("WildcardQuery(\"a*\")");
        System.out.println(wildcardAutomaton.toDot());
    }

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U00000000-\\U000000ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U00000000-\\U0010ffff"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U00000000-\\U0010ffff"]
}



On Tue, 18 Feb 2020 at 13:52, <baris.ka...@oracle.com <mailto:baris.ka...@oracle.com>> wrote:

    Michael and Forum,-
    Thanks for thegreat explanations.

    one question please:

    why is PrefixQuery used instead of WildCardQuery in the below
    snippet?

    Best regards

    > On Feb 17, 2020, at 3:01 PM, Michael Froh <msf...@gmail.com
    <mailto:msf...@gmail.com>> wrote:
    >
    > Hi Baris,
    >
    > The idea with PhraseWildcardQuery is that you can mix literal
    "exact" terms
    > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using
    addTerm is
    > for exact terms, while addMultiTerm is for things that may
    match a number
    > of possible terms in the given position.
    >
    > If you want to search for term1 followed by any term that
    starts with a
    > given character, I would suggest using:
    >
    > int maxMultiTermExpansions = ...; // Discussed below
    > PhraseWildCardQuery.Builder builder = new
    PhraseWildcardQuery("field",
    > maxMultiTermExpansions);
    > builder.addTerm(new BytesRef("term1")); // Add fixed term in
    position 0
    > builder.addMultiTerm(new PrefixQuery(new Term("field",
    "term2FirstChar")));
    > // Add multiterm in position 1
    > Query q = builder.build();
    >
    > The PrefixQuery effectively gets expanded into a bunch of
    possible terms,
    > based on the term dictionary on each index segment. To avoid
    expanding to
    > cover too many terms (say, if you added a bunch of WildcardQuery),
    > maxMultiTermExpansions serves as a guard rail, to put a rough
    bound on
    > memory consumption and query execution time. If you're
    interested in
    > details of how the maxMultiTermExpansions budget is distributed
    across
    > MultiTerms, check out PhraseWildcardQuery.createWeight. If
    you're just
    > running an experiment in your IDE, you could probably set
    > maxMultiTermExpansions to Integer.MAX_VALUE. (If you're running
    in a
    > production environment, it's likely a good idea to tune it down
    based on
    > your memory/latency constraints.)
    >
    > Incidentally, for tracking down the source code for anything in
    Lucene,
    > it's probably better to go to GitHub for the most up-to-date
    source:
    >
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIq9tVLYyw$

    > .
    >
    > Hope that helps,
    > Michael
    >
    >> On Thu, 13 Feb 2020 at 12:29, <baris.ka...@oracle.com
    <mailto:baris.ka...@oracle.com>> wrote:
    >>
    >> Hi,-
    >>
    >> i hope everyone is doing great.
    >>
    >>  if i want to do the following search with PhraseWildCardQuery and
    >> thanks to this forum for letting me know about this class
    (Especially to
    >> David and Bruno)
    >>
    >> term1 term2FirstChar*
    >>
    >> i need to do two ways: (i found the source code at
    >>
    >>
https://urldefense.com/v3/__https://fossies.org/linux/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!ONqQgLIltNBUuSo5Cn_Fz7-wuR1LQv68YS_z-6g7X-S86PHQtT9tKl7VbIpV8n29nQ$

    >> )
    >>
    >> /*
    >>
    >> maxMultiTermExpansions - The maximum number of expansions
    across all
    >> multi-terms and across all segments. It counts expansions for each
    >> segments individually, that allows optimizations per segment
    and unused
    >> expansions are credited to next segments. This is different from
    >> MultiPhraseQuery and SpanMultiTermQueryWrapper which have an
    expansion
    >> limit per multi-term.
    >>
    >> segmentOptimizationEnabled - Whether to enable the segment
    optimization
    >> which consists in ignoring a segment for further analysis as
    soon as a
    >> term is not present inside it. This optimizes the query execution
    >> performance but changes the scoring. The result ranking is
    preserved.
    >>
    >> */
    >>
    >>
    >> 1st way:
    >>
    >> PhraseWildCardQuery.Builder builder =
    PharseWildCardQuery.Builder(field,
    >> 2 _*/<<< i dont know what number to use here for
    >> maxMultiTermExpansions>>>/*_, true/*boolean
    segmentOptimizationEnabled*/)
    >>
    >> pwcqBuilder.addTerm(field, new Term(field, "term1"));
    >>
    >> pwcqBuilder.addTerm(field,new Term(field, "term2FirstChar"));
    >>
    >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
    >>
    >> or
    >>
    >> 2nd way:
    >>
    >> pwcqBuilder.addMultiTerm(MultiTermQuery object here contaning
    {field,
    >> "term1"} and {field ,"term2FirstChar"});
    >>
    >> PhraseWildCardQuery pwcq = pwcqBuilder.build();
    >>
    >>
    >> Then this pwcq object will be fed into IndexSearcher's as the
    query
    >> parameter.
    >>
    >>
    >> Now, it looks like the first way will not consider expansions
    or in
    >> other words wildcard? Am i right?
    >>
    >> i also need to understand this maxMultiTermExpansions
    parameter better.
    >> For instance if first way is used, will maxMultiTermExpansions be
    >> meaningful?
    >>
    >>
    >> Thanks
    >>
    >>


---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
    For additional commands, e-mail: java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to