Re: ComplexPhraseQueryParser performance question

baris . kazar Wed, 12 Feb 2020 19:01:12 -0800

Thanks David, can i look at the source code?
i think ComplexPhraseQueryParser uses
something similar. 
i will check the differences but do You know the differences for quick 
reference?
Thanks




> On Feb 12, 2020, at 6:41 PM, David Smiley <[email protected]> wrote:
> 
> 
> Hi,
> 
> See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module.  
> It was recently added by my amazing colleague Bruno.  At this time there is 
> no query parser that uses it in Lucene unfortunately but you can rectify this 
> for your own purposes.  I hope this query "graduates" to Lucene core some 
> day.  It's placement in sandbox is why it can't be added to any of Lucene's 
> query parsers like complex phrase.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
> 
> 
>> On Wed, Feb 12, 2020 at 11:07 AM <[email protected]> wrote:
>> Hi,-
>> 
>> Regarding this mechanisms below i mentioned,
>> 
>> does this class offer any Shingling capability embedded to it?
>> 
>> I could not find any api within this class ComplexPhraseQueryParser for 
>> that purpose.
>> 
>> 
>> For instance does this class offer the most commonly used words api?
>> 
>> i can then use one of those words as to use the second and third char 
>> from it to search like
>> 
>> term1 term2FirstCharTerm2SecondChar* (where i would look up 
>> term2FirstChar in my dictionary hashmap for the most common word value 
>> and bring its second char into the search query)
>> 
>> 
>> Having second char in the search query reduces search time by 20 times.
>> 
>> 
>> Otherwise, do i have to use the following at index time? i already have 
>> TextField index with my custom analyzer.
>> 
>> How should i embed the shingling filter into my current custom analyzer? 
>> i dont want to disturb my current indexing.
>> 
>> All i want to do is to find most common word in my data for each letter 
>> in the alphabet.
>> 
>> Should i do this at search time? That would be costly, right?
>> 
>> 
>> view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/
>> 
>> 
>> <p><a href="http://lucene.apache.org/";><img class="alignleft size-full 
>> wp-image-524" title="lucene_green_300" 
>> src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif";
>>  
>> alt="lucene_green_300" hspace="15" width="300" height="46" align="left" 
>> /></a> If you need to parse the tokens n-grams of a string, you may use 
>> the facilities offered by lucene analyzers.</p>
>> <p>What you simply have to do is to build you own analyzer using a 
>> ShingleMatrixFilter with the parameters that suits you needs. For 
>> instance, here the few lines of code to build a token bi-grams analyzer:</p>
>> <pre lang="java">public class NGramAnalyzer extends Analyzer {
>>      @Override
>>      public TokenStream tokenStream(String fieldName, Reader reader) {
>>         return new StopFilter(new LowerCaseFilter(new 
>> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
>>             StopAnalyzer.ENGLISH_STOP_WORDS);
>>       }
>> }</pre>
>> <p>The parameters of the ShingleMatrixFilter simply states the minimum 
>> and maximum shingle size. &#8220;Shingle&#8221; is just another name for 
>> token N-Grams and is popular to be the basic units to help solving 
>> problems in spell checking, near-duplicate detection and others.<br />
>> Note also the use of a StandardTokenizer to deal with basic special 
>> characters like hyphens or other &#8220;disturbers&#8221;. </p>
>> <p>To use the analyzer, you can for instance do:</p>
>> <pre lang="java">
>>      public static void main(String[] args) {
>>          try {
>>              String str = "An easy way to write an analyzer for tokens 
>> bi-gram (or even tokens n-grams) with lucene";
>>              Analyzer analyzer = new NGramAnalyzer();
>> 
>>              TokenStream stream = analyzer.tokenStream("content", new 
>> StringReader(str));
>>              Token token = new Token();
>>              while ((token = stream.next(token)) != null){
>>                  System.out.println(token.term());
>>              }
>> 
>>          } catch (IOException ie) {
>>              System.out.println("IO Error " + ie.getMessage());
>>          }
>>      }
>> </pre>
>> <p>The output will print:</p>
>> <pre lang="none">
>> an easy
>> easy way
>> way to
>> to write
>> write an
>> an analyzer
>> analyzer for
>> for tokens
>> tokens bi
>> bi gram
>> gram or
>> or even
>> even tokens
>> tokens n
>> n grams
>> grams with
>> with lucene
>> </pre>
>> <p>Note that the text &#8220;bi-gram&#8221; was treated like two 
>> different tokens, as a desired consequence of using a StandardTokenizer 
>> in the ShingleMatrixFilter initialization.</p>
>> 
>> 
>> Best regards
>> 
>> On 2/4/20 11:14 AM, [email protected] wrote:
>> >
>> > Thanks but i thought this class would have a mechanism to fix this issue.
>> > Thanks
>> >
>> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <[email protected]> wrote:
>> >>
>> >> It's slow per se, since it loads terms positions. Usual advices are
>> >> shingling or edge ngrams. Note, if this is not a text but a string or 
>> >> enum,
>> >> it probably let to apply another tricks. Another idea is perhaps
>> >> IntervalQueries can be smarter and faster in certain cases, although they
>> >> are backed on the same slow positions.
>> >>
>> >>> On Tue, Feb 4, 2020 at 7:25 AM <[email protected]> wrote:
>> >>>
>> >>> How can this slowdown be resolved?
>> >>> is this another limitation of this class?
>> >>> Thanks
>> >>>
>> >>>>> On Feb 3, 2020, at 4:14 PM, [email protected] wrote:
>> >>>> Please ignore the first comparison there. i was comparing there {term1
>> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>> >>>>
>> >>>> The slowdown is
>> >>>>
>> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
>> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>> >>>> Best regards
>> >>>>
>> >>>>
>> >>>>> On 2/3/20 4:13 PM, [email protected] wrote:
>> >>>>> Hi,-
>> >>>>>
>> >>>>> i hope everyone is doing great.
>> >>>>>
>> >>>>> I saw this issue with this class such that if you search for "term1*"
>> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>> >>> millisecs when it is 2 chars)
>> >>>>> but when you search for "term1 term2*" where when term2 is a single
>> >>> char, the performance degrades too much.
>> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
>> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>> >>>>> Is there any suggestion to speed it up?
>> >>>>>
>> >>>>> Best regards
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail: [email protected]
>> >>>>> For additional commands, e-mail: [email protected]
>> >>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: [email protected]
>> >>>> For additional commands, e-mail: [email protected]
>> >>>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>> >>>
>> >> -- 
>> >> Sincerely yours
>> >> Mikhail Khludnev
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

Re: ComplexPhraseQueryParser performance question

Reply via email to