I am mostly referring to keyset pagination so sorting the field is always necessary. For the same reason range queries are inevitable.
Στις Τρί, 2 Απρ 2019 στις 5:08 μ.μ., ο/η Robert Muir <rcm...@gmail.com> έγραψε: > I think instead of using range queries for custom pagination, it would > be best to simply sort by the field normally (with or without > searchAfter) > > On Tue, Apr 2, 2019 at 7:34 AM Stamatis Zampetakis <zabe...@gmail.com> > wrote: > > > > Thanks for the explanation regarding IntersectTermsEnum.next(). I > > understand better now. > > > > How about indexing a string field containing the firstname and lastname > of > > a person or possibly its address. > > Values for these fields have variable length and usually they have more > > than 16 bytes. If my understanding is correct then they should (can) not > be > > indexed as points. > > Range queries on this fields appear mostly when users implement custom > > pagination (without using searchAfter) on the result set. > > > > The alternative that I consider and seems to work well for the use-cases > I > > mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery. > > > > > > > > > > > > Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir <rcm...@gmail.com> > > έγραψε: > > > > > Can you explain a little more about your use-case? I think that's the > > > biggest problem here for term range query. Pretty much all range > > > use-cases are converted over to space partitioned data structures > > > (Points) so its unclear why this query would even be used for anything > > > serious. > > > > > > To answer your question, IntersectTermsEnum.next() doesn't iterate > > > over the whole index, just the term dictionary. But the lines are > > > blurred since if there's only one posting for a term, that single > > > posting is inlined directly into the term dictionary.... and often a > > > ton of terms fit into this category, so seeing 80% time in the term > > > dictionary wouldn't surprise me. > > > > > > On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis <zabe...@gmail.com> > > > wrote: > > > > > > > > Thanks Robert and Uwe for your feedback! > > > > > > > > I am not sure what you mean that the slowness does not come from the > > > iteration over the term index. > > > > I did a small profiling (screenshot attached) of sending repeatedly a > > > TermRangeQuery that matches almost the whole index and I observe that > 80% > > > of the time is spend on IntersectTermsEnum.next() method. > > > > Isn't this the method which iterates over the index? > > > > > > > > > > > > > > > > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler < > u...@thetaphi.de> > > > έγραψε: > > > >> > > > >> Hi again, > > > >> > > > >> > The problem with TermRangeQueries is actually not the iteration > over > > > the > > > >> > term index. The slowness comes from the fact that all terms > between > > > start > > > >> > and end have to be iterated and their postings be fetched and > those > > > postings > > > >> > be merged together. If the "source of terms" for doing this is > just a > > > simple > > > >> > linear iteration of all terms from/to or the automaton > intersection > > > does not > > > >> > really matter for the query execution. The change to prefer the > > > automaton > > > >> > instead of a simple term iteration is just to allow further > > > optimizations, for > > > >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879 > > > >> > > > >> So the above issue brings no further improvements (currently). So as > > > said before, a simple enumeration of all all terms and fetching their > > > postings will be as fast. By using an automaton, the idea here is that > the > > > term dictionary may have some "percalculated" postings list for prefix > > > terms. So instead of iterating over all terms and merging their > postings, > > > the terms dictionary could return a "virtual term" that contains all > > > documents for a whole prefix and store the merged postings list in the > > > index file. This was not yet implemented but is the place to hook > into. You > > > could create an improved BlockTermsDict implementation, that allows to > get > > > a PostingsEnum for a whole prefix of terms. > > > >> > > > >> Not sure how much of that was already implemented by LUCENE-5879, > but > > > it allows to do this. So here is where you could step in and improve > the > > > terms dictionary! > > > >> > > > >> Uwe > > > >> > > > >> > Uwe > > > >> > > > > >> > ----- > > > >> > Uwe Schindler > > > >> > Achterdiek 19, D-28357 Bremen > > > >> > http://www.thetaphi.de > > > >> > eMail: u...@thetaphi.de > > > >> > > > > >> > > -----Original Message----- > > > >> > > From: Robert Muir <rcm...@gmail.com> > > > >> > > Sent: Monday, April 1, 2019 6:30 PM > > > >> > > To: java-user <java-user@lucene.apache.org> > > > >> > > Subject: Re: Clarification regarding BlockTree implementation of > > > >> > > IntersectTermsEnum > > > >> > > > > > >> > > The regular TermsEnum is really designed for walking terms in > > > linear order. > > > >> > > it does have some ability to seek/leapfrog. But this means > paths in > > > a query > > > >> > > automaton that match no terms result in a wasted seek and cpu, > > > because > > > >> > the > > > >> > > api is designed to return the next term after regardless. > > > >> > > > > > >> > > On the other hand the intersect() is for intersecting two > automata: > > > query > > > >> > > and index. Presumably it can also remove more inefficiencies > than > > > just the > > > >> > > wasted seeks for complex wildcards and fuzzies and stuff, since > it > > > can > > > >> > > "see" the whole input as an automaton. so for example it might > be > > > able to > > > >> > > work on blocks of terms at a time and so on. > > > >> > > > > > >> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis > > > >> > <zabe...@gmail.com> > > > >> > > wrote: > > > >> > > > > > >> > > > Yes it is used. > > > >> > > > > > > >> > > > I think there are simpler and possibly more efficient ways to > > > implement a > > > >> > > > TermRangeQuery and that is why I am looking into this. > > > >> > > > But I am also curious to understand what IntersectTermsEnum is > > > >> > supposed > > > >> > > to > > > >> > > > do. > > > >> > > > > > > >> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir < > > > rcm...@gmail.com> > > > >> > > > έγραψε: > > > >> > > > > > > >> > > > > Is this IntersectTermsEnum really being used for term range > > > query? > > > >> > > Seems > > > >> > > > > like using a standard TermsEnum, seeking to the start of the > > > range, then > > > >> > > > > calling next until the end would be easier. > > > >> > > > > > > > >> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis > > > >> > > <zabe...@gmail.com> > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > > > Hi all, > > > >> > > > > > > > > >> > > > > > I am currently working on improving the performance of > range > > > >> > queries > > > >> > > on > > > >> > > > > > strings. I've noticed that using TermRangeQuery with > > > low-selective > > > >> > > > > queries > > > >> > > > > > is a very bad idea in terms of performance but I cannot > > > clearly explain > > > >> > > > > why > > > >> > > > > > since it seems related with how the > IntersectTermsEnum#next > > > >> > method > > > >> > > is > > > >> > > > > > implemented. > > > >> > > > > > > > > >> > > > > > The Javadoc of the class says that the terms index (the > > > burst-trie > > > >> > > > > > datastructure) is not used by this implementation of > > > TermsEnum. > > > >> > > > However, > > > >> > > > > > when I see the implementation of the next method I get the > > > >> > impression > > > >> > > > > that > > > >> > > > > > this is not accurate. Aren't we using the trie structure > to > > > skip parts > > > >> > > > of > > > >> > > > > > the data when the automaton states do not match? > > > >> > > > > > > > > >> > > > > > Can somebody provide a high-level intutition of what > > > >> > > > > > IntersectTermsEnum#next does? Initially, I thought that > it is > > > >> > > > traversing > > > >> > > > > > the whole trie structure (skipping some branches when > > > necessary) but > > > >> > I > > > >> > > > > may > > > >> > > > > > be wrong. > > > >> > > > > > > > > >> > > > > > Thanks in advance, > > > >> > > > > > Stamatis > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > >> > > > >> > --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > >> > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >