Thanks David, can i look at the source code? i think ComplexPhraseQueryParser uses something similar. i will check the differences but do You know the differences for quick reference? Thanks
> On Feb 12, 2020, at 6:41 PM, David Smiley <david.w.smi...@gmail.com> wrote: > > > Hi, > > See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module. > It was recently added by my amazing colleague Bruno. At this time there is > no query parser that uses it in Lucene unfortunately but you can rectify this > for your own purposes. I hope this query "graduates" to Lucene core some > day. It's placement in sandbox is why it can't be added to any of Lucene's > query parsers like complex phrase. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > >> On Wed, Feb 12, 2020 at 11:07 AM <baris.ka...@oracle.com> wrote: >> Hi,- >> >> Regarding this mechanisms below i mentioned, >> >> does this class offer any Shingling capability embedded to it? >> >> I could not find any api within this class ComplexPhraseQueryParser for >> that purpose. >> >> >> For instance does this class offer the most commonly used words api? >> >> i can then use one of those words as to use the second and third char >> from it to search like >> >> term1 term2FirstCharTerm2SecondChar* (where i would look up >> term2FirstChar in my dictionary hashmap for the most common word value >> and bring its second char into the search query) >> >> >> Having second char in the search query reduces search time by 20 times. >> >> >> Otherwise, do i have to use the following at index time? i already have >> TextField index with my custom analyzer. >> >> How should i embed the shingling filter into my current custom analyzer? >> i dont want to disturb my current indexing. >> >> All i want to do is to find most common word in my data for each letter >> in the alphabet. >> >> Should i do this at search time? That would be costly, right? >> >> >> view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/ >> >> >> <p><a href="http://lucene.apache.org/"><img class="alignleft size-full >> wp-image-524" title="lucene_green_300" >> src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif" >> >> alt="lucene_green_300" hspace="15" width="300" height="46" align="left" >> /></a> If you need to parse the tokens n-grams of a string, you may use >> the facilities offered by lucene analyzers.</p> >> <p>What you simply have to do is to build you own analyzer using a >> ShingleMatrixFilter with the parameters that suits you needs. For >> instance, here the few lines of code to build a token bi-grams analyzer:</p> >> <pre lang="java">public class NGramAnalyzer extends Analyzer { >> @Override >> public TokenStream tokenStream(String fieldName, Reader reader) { >> return new StopFilter(new LowerCaseFilter(new >> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')), >> StopAnalyzer.ENGLISH_STOP_WORDS); >> } >> }</pre> >> <p>The parameters of the ShingleMatrixFilter simply states the minimum >> and maximum shingle size. “Shingle” is just another name for >> token N-Grams and is popular to be the basic units to help solving >> problems in spell checking, near-duplicate detection and others.<br /> >> Note also the use of a StandardTokenizer to deal with basic special >> characters like hyphens or other “disturbers”. </p> >> <p>To use the analyzer, you can for instance do:</p> >> <pre lang="java"> >> public static void main(String[] args) { >> try { >> String str = "An easy way to write an analyzer for tokens >> bi-gram (or even tokens n-grams) with lucene"; >> Analyzer analyzer = new NGramAnalyzer(); >> >> TokenStream stream = analyzer.tokenStream("content", new >> StringReader(str)); >> Token token = new Token(); >> while ((token = stream.next(token)) != null){ >> System.out.println(token.term()); >> } >> >> } catch (IOException ie) { >> System.out.println("IO Error " + ie.getMessage()); >> } >> } >> </pre> >> <p>The output will print:</p> >> <pre lang="none"> >> an easy >> easy way >> way to >> to write >> write an >> an analyzer >> analyzer for >> for tokens >> tokens bi >> bi gram >> gram or >> or even >> even tokens >> tokens n >> n grams >> grams with >> with lucene >> </pre> >> <p>Note that the text “bi-gram” was treated like two >> different tokens, as a desired consequence of using a StandardTokenizer >> in the ShingleMatrixFilter initialization.</p> >> >> >> Best regards >> >> On 2/4/20 11:14 AM, baris.ka...@oracle.com wrote: >> > >> > Thanks but i thought this class would have a mechanism to fix this issue. >> > Thanks >> > >> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <m...@apache.org> wrote: >> >> >> >> It's slow per se, since it loads terms positions. Usual advices are >> >> shingling or edge ngrams. Note, if this is not a text but a string or >> >> enum, >> >> it probably let to apply another tricks. Another idea is perhaps >> >> IntervalQueries can be smarter and faster in certain cases, although they >> >> are backed on the same slow positions. >> >> >> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.ka...@oracle.com> wrote: >> >>> >> >>> How can this slowdown be resolved? >> >>> is this another limitation of this class? >> >>> Thanks >> >>> >> >>>>> On Feb 3, 2020, at 4:14 PM, baris.ka...@oracle.com wrote: >> >>>> Please ignore the first comparison there. i was comparing there {term1 >> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char} >> >>>> >> >>>> The slowdown is >> >>>> >> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared >> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char. >> >>>> Best regards >> >>>> >> >>>> >> >>>>> On 2/3/20 4:13 PM, baris.ka...@oracle.com wrote: >> >>>>> Hi,- >> >>>>> >> >>>>> i hope everyone is doing great. >> >>>>> >> >>>>> I saw this issue with this class such that if you search for "term1*" >> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250 >> >>> millisecs when it is 2 chars) >> >>>>> but when you search for "term1 term2*" where when term2 is a single >> >>> char, the performance degrades too much. >> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared >> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char. >> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) >> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char. >> >>>>> Is there any suggestion to speed it up? >> >>>>> >> >>>>> Best regards >> >>>>> >> >>>>> >> >>>>> >> >>>>> --------------------------------------------------------------------- >> >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>>>> >> >>>> --------------------------------------------------------------------- >> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >>> >> >> -- >> >> Sincerely yours >> >> Mikhail Khludnev >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>