Thanks Robert and Uwe for your feedback! I am not sure what you mean that the slowness does not come from the iteration over the term index. I did a small profiling (screenshot attached) of sending repeatedly a TermRangeQuery that matches almost the whole index and I observe that 80% of the time is spend on IntersectTermsEnum.next() method. Isn't this the method which iterates over the index?
Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <u...@thetaphi.de> έγραψε: > Hi again, > > > The problem with TermRangeQueries is actually not the iteration over the > > term index. The slowness comes from the fact that all terms between start > > and end have to be iterated and their postings be fetched and those > postings > > be merged together. If the "source of terms" for doing this is just a > simple > > linear iteration of all terms from/to or the automaton intersection does > not > > really matter for the query execution. The change to prefer the automaton > > instead of a simple term iteration is just to allow further > optimizations, for > > more info see https://issues.apache.org/jira/browse/LUCENE-5879 > > So the above issue brings no further improvements (currently). So as said > before, a simple enumeration of all all terms and fetching their postings > will be as fast. By using an automaton, the idea here is that the term > dictionary may have some "percalculated" postings list for prefix terms. So > instead of iterating over all terms and merging their postings, the terms > dictionary could return a "virtual term" that contains all documents for a > whole prefix and store the merged postings list in the index file. This was > not yet implemented but is the place to hook into. You could create an > improved BlockTermsDict implementation, that allows to get a PostingsEnum > for a whole prefix of terms. > > Not sure how much of that was already implemented by LUCENE-5879, but it > allows to do this. So here is where you could step in and improve the terms > dictionary! > > Uwe > > > Uwe > > > > ----- > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -----Original Message----- > > > From: Robert Muir <rcm...@gmail.com> > > > Sent: Monday, April 1, 2019 6:30 PM > > > To: java-user <java-user@lucene.apache.org> > > > Subject: Re: Clarification regarding BlockTree implementation of > > > IntersectTermsEnum > > > > > > The regular TermsEnum is really designed for walking terms in linear > order. > > > it does have some ability to seek/leapfrog. But this means paths in a > query > > > automaton that match no terms result in a wasted seek and cpu, because > > the > > > api is designed to return the next term after regardless. > > > > > > On the other hand the intersect() is for intersecting two automata: > query > > > and index. Presumably it can also remove more inefficiencies than just > the > > > wasted seeks for complex wildcards and fuzzies and stuff, since it can > > > "see" the whole input as an automaton. so for example it might be able > to > > > work on blocks of terms at a time and so on. > > > > > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis > > <zabe...@gmail.com> > > > wrote: > > > > > > > Yes it is used. > > > > > > > > I think there are simpler and possibly more efficient ways to > implement a > > > > TermRangeQuery and that is why I am looking into this. > > > > But I am also curious to understand what IntersectTermsEnum is > > supposed > > > to > > > > do. > > > > > > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir < > rcm...@gmail.com> > > > > έγραψε: > > > > > > > > > Is this IntersectTermsEnum really being used for term range query? > > > Seems > > > > > like using a standard TermsEnum, seeking to the start of the > range, then > > > > > calling next until the end would be easier. > > > > > > > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis > > > <zabe...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I am currently working on improving the performance of range > > queries > > > on > > > > > > strings. I've noticed that using TermRangeQuery with > low-selective > > > > > queries > > > > > > is a very bad idea in terms of performance but I cannot clearly > explain > > > > > why > > > > > > since it seems related with how the IntersectTermsEnum#next > > method > > > is > > > > > > implemented. > > > > > > > > > > > > The Javadoc of the class says that the terms index (the > burst-trie > > > > > > datastructure) is not used by this implementation of TermsEnum. > > > > However, > > > > > > when I see the implementation of the next method I get the > > impression > > > > > that > > > > > > this is not accurate. Aren't we using the trie structure to skip > parts > > > > of > > > > > > the data when the automaton states do not match? > > > > > > > > > > > > Can somebody provide a high-level intutition of what > > > > > > IntersectTermsEnum#next does? Initially, I thought that it is > > > > traversing > > > > > > the whole trie structure (skipping some branches when necessary) > but > > I > > > > > may > > > > > > be wrong. > > > > > > > > > > > > Thanks in advance, > > > > > > Stamatis > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org