Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Stamatis Zampetakis Thu, 04 Apr 2019 09:37:47 -0700

I am mostly referring to keyset pagination so sorting the field is always
necessary.
For the same reason range queries are inevitable.


Στις Τρί, 2 Απρ 2019 στις 5:08 μ.μ., ο/η Robert Muir <[email protected]>
έγραψε:

> I think instead of using range queries for custom pagination, it would
> be best to simply sort by the field normally (with or without
> searchAfter)
>
> On Tue, Apr 2, 2019 at 7:34 AM Stamatis Zampetakis <[email protected]>
> wrote:
> >
> > Thanks for the explanation regarding IntersectTermsEnum.next(). I
> > understand better now.
> >
> > How about indexing a string field containing the firstname and lastname
> of
> > a person or possibly its address.
> > Values for these fields have variable length and usually they have more
> > than 16 bytes. If my understanding is correct then they should (can) not
> be
> > indexed as points.
> > Range queries on this fields appear mostly when users implement custom
> > pagination (without using searchAfter) on the result set.
> >
> > The alternative that I consider and seems to work well for the use-cases
> I
> > mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.
> >
> >
> >
> >
> >
> > Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir <[email protected]>
> > έγραψε:
> >
> > > Can you explain a little more about your use-case? I think that's the
> > > biggest problem here for term range query. Pretty much all range
> > > use-cases are converted over to space partitioned data structures
> > > (Points) so its unclear why this query would even be used for anything
> > > serious.
> > >
> > > To answer your question, IntersectTermsEnum.next() doesn't iterate
> > > over the whole index, just the term dictionary. But the lines are
> > > blurred since if there's only one posting for a term, that single
> > > posting is inlined directly into the term dictionary.... and often a
> > > ton of terms fit into this category, so seeing 80% time in the term
> > > dictionary wouldn't surprise me.
> > >
> > > On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis <[email protected]>
> > > wrote:
> > > >
> > > > Thanks Robert and Uwe for your feedback!
> > > >
> > > > I am not sure what you mean that the slowness does not come from the
> > > iteration over the term index.
> > > > I did a small profiling (screenshot attached) of sending repeatedly a
> > > TermRangeQuery that matches almost the whole index and I observe that
> 80%
> > > of the time is spend on IntersectTermsEnum.next() method.
> > > > Isn't this the method which iterates over the index?
> > > >
> > > >
> > > >
> > > > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <
> [email protected]>
> > > έγραψε:
> > > >>
> > > >> Hi again,
> > > >>
> > > >> > The problem with TermRangeQueries is actually not the iteration
> over
> > > the
> > > >> > term index. The slowness comes from the fact that all terms
> between
> > > start
> > > >> > and end have to be iterated and their postings be fetched and
> those
> > > postings
> > > >> > be merged together. If the "source of terms" for doing this is
> just a
> > > simple
> > > >> > linear iteration of all terms from/to or the automaton
> intersection
> > > does not
> > > >> > really matter for the query execution. The change to prefer the
> > > automaton
> > > >> > instead of a simple term iteration is just to allow further
> > > optimizations, for
> > > >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> > > >>
> > > >> So the above issue brings no further improvements (currently). So as
> > > said before, a simple enumeration of all all terms and fetching their
> > > postings will be as fast. By using an automaton, the idea here is that
> the
> > > term dictionary may have some "percalculated" postings list for prefix
> > > terms. So instead of iterating over all terms and merging their
> postings,
> > > the terms dictionary could return a "virtual term" that contains all
> > > documents for a whole prefix and store the merged postings list in the
> > > index file. This was not yet implemented but is the place to hook
> into. You
> > > could create an improved BlockTermsDict implementation, that allows to
> get
> > > a PostingsEnum for a whole prefix of terms.
> > > >>
> > > >> Not sure how much of that was already implemented by LUCENE-5879,
> but
> > > it allows to do this. So here is where you could step in and improve
> the
> > > terms dictionary!
> > > >>
> > > >> Uwe
> > > >>
> > > >> > Uwe
> > > >> >
> > > >> > -----
> > > >> > Uwe Schindler
> > > >> > Achterdiek 19, D-28357 Bremen
> > > >> > http://www.thetaphi.de
> > > >> > eMail: [email protected]
> > > >> >
> > > >> > > -----Original Message-----
> > > >> > > From: Robert Muir <[email protected]>
> > > >> > > Sent: Monday, April 1, 2019 6:30 PM
> > > >> > > To: java-user <[email protected]>
> > > >> > > Subject: Re: Clarification regarding BlockTree implementation of
> > > >> > > IntersectTermsEnum
> > > >> > >
> > > >> > > The regular TermsEnum is really designed for walking terms in
> > > linear order.
> > > >> > > it does have some ability to seek/leapfrog. But this means
> paths in
> > > a query
> > > >> > > automaton that match no terms result in a wasted seek and cpu,
> > > because
> > > >> > the
> > > >> > > api is designed to return the next term after regardless.
> > > >> > >
> > > >> > > On the other hand the intersect() is for intersecting two
> automata:
> > > query
> > > >> > > and index. Presumably it can also remove more inefficiencies
> than
> > > just the
> > > >> > > wasted seeks for complex wildcards and fuzzies and stuff, since
> it
> > > can
> > > >> > > "see" the whole input as an automaton. so for example it might
> be
> > > able to
> > > >> > > work on blocks of terms at a time and so on.
> > > >> > >
> > > >> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> > > >> > <[email protected]>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it is used.
> > > >> > > >
> > > >> > > > I think there are simpler and possibly more efficient ways to
> > > implement a
> > > >> > > > TermRangeQuery and that is why I am looking into this.
> > > >> > > > But I am also curious to understand what IntersectTermsEnum is
> > > >> > supposed
> > > >> > > to
> > > >> > > > do.
> > > >> > > >
> > > >> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> > > [email protected]>
> > > >> > > > έγραψε:
> > > >> > > >
> > > >> > > > > Is this IntersectTermsEnum really being used for term range
> > > query?
> > > >> > > Seems
> > > >> > > > > like using a standard TermsEnum, seeking to the start of the
> > > range, then
> > > >> > > > > calling next until the end would be easier.
> > > >> > > > >
> > > >> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > > >> > > <[email protected]>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi all,
> > > >> > > > > >
> > > >> > > > > > I am currently working on improving the performance of
> range
> > > >> > queries
> > > >> > > on
> > > >> > > > > > strings. I've noticed that using TermRangeQuery with
> > > low-selective
> > > >> > > > > queries
> > > >> > > > > > is a very bad idea in terms of performance but I cannot
> > > clearly explain
> > > >> > > > > why
> > > >> > > > > > since it seems related with how the
> IntersectTermsEnum#next
> > > >> > method
> > > >> > > is
> > > >> > > > > > implemented.
> > > >> > > > > >
> > > >> > > > > > The Javadoc of the class says that the terms index (the
> > > burst-trie
> > > >> > > > > > datastructure) is not used by this implementation of
> > > TermsEnum.
> > > >> > > > However,
> > > >> > > > > > when I see the implementation of the next method I get the
> > > >> > impression
> > > >> > > > > that
> > > >> > > > > > this is not accurate. Aren't we using the trie structure
> to
> > > skip parts
> > > >> > > > of
> > > >> > > > > > the data when  the automaton states do not match?
> > > >> > > > > >
> > > >> > > > > > Can somebody provide a high-level intutition of what
> > > >> > > > > > IntersectTermsEnum#next does? Initially, I thought that
> it is
> > > >> > > > traversing
> > > >> > > > > > the whole trie structure (skipping some branches when
> > > necessary) but
> > > >> > I
> > > >> > > > > may
> > > >> > > > > > be wrong.
> > > >> > > > > >
> > > >> > > > > > Thanks in advance,
> > > >> > > > > > Stamatis
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [email protected]
> > > >> For additional commands, e-mail: [email protected]
> > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Reply via email to