Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Stamatis Zampetakis Tue, 02 Apr 2019 05:04:21 -0700

Thanks Robert and Uwe for your feedback!

I am not sure what you mean that the slowness does not come from the
iteration over the term index.
I did a small profiling (screenshot attached) of sending repeatedly a
TermRangeQuery that matches almost the whole index and I observe that 80%
of the time is spend on IntersectTermsEnum.next() method.
Isn't this the method which iterates over the index?




Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <u...@thetaphi.de>
έγραψε:

> Hi again,
>
> > The problem with TermRangeQueries is actually not the iteration over the
> > term index. The slowness comes from the fact that all terms between start
> > and end have to be iterated and their postings be fetched and those
> postings
> > be merged together. If the "source of terms" for doing this is just a
> simple
> > linear iteration of all terms from/to or the automaton intersection does
> not
> > really matter for the query execution. The change to prefer the automaton
> > instead of a simple term iteration is just to allow further
> optimizations, for
> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
>
> So the above issue brings no further improvements (currently). So as said
> before, a simple enumeration of all all terms and fetching their postings
> will be as fast. By using an automaton, the idea here is that the term
> dictionary may have some "percalculated" postings list for prefix terms. So
> instead of iterating over all terms and merging their postings, the terms
> dictionary could return a "virtual term" that contains all documents for a
> whole prefix and store the merged postings list in the index file. This was
> not yet implemented but is the place to hook into. You could create an
> improved BlockTermsDict implementation, that allows to get a PostingsEnum
> for a whole prefix of terms.
>
> Not sure how much of that was already implemented by LUCENE-5879, but it
> allows to do this. So here is where you could step in and improve the terms
> dictionary!
>
> Uwe
>
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Robert Muir <rcm...@gmail.com>
> > > Sent: Monday, April 1, 2019 6:30 PM
> > > To: java-user <java-user@lucene.apache.org>
> > > Subject: Re: Clarification regarding BlockTree implementation of
> > > IntersectTermsEnum
> > >
> > > The regular TermsEnum is really designed for walking terms in linear
> order.
> > > it does have some ability to seek/leapfrog. But this means paths in a
> query
> > > automaton that match no terms result in a wasted seek and cpu, because
> > the
> > > api is designed to return the next term after regardless.
> > >
> > > On the other hand the intersect() is for intersecting two automata:
> query
> > > and index. Presumably it can also remove more inefficiencies than just
> the
> > > wasted seeks for complex wildcards and fuzzies and stuff, since it can
> > > "see" the whole input as an automaton. so for example it might be able
> to
> > > work on blocks of terms at a time and so on.
> > >
> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> > <zabe...@gmail.com>
> > > wrote:
> > >
> > > > Yes it is used.
> > > >
> > > > I think there are simpler and possibly more efficient ways to
> implement a
> > > > TermRangeQuery and that is why I am looking into this.
> > > > But I am also curious to understand what IntersectTermsEnum is
> > supposed
> > > to
> > > > do.
> > > >
> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> rcm...@gmail.com>
> > > > έγραψε:
> > > >
> > > > > Is this IntersectTermsEnum really being used for term range query?
> > > Seems
> > > > > like using a standard TermsEnum, seeking to the start of the
> range, then
> > > > > calling next until the end would be easier.
> > > > >
> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > > <zabe...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am currently working on improving the performance of range
> > queries
> > > on
> > > > > > strings. I've noticed that using TermRangeQuery with
> low-selective
> > > > > queries
> > > > > > is a very bad idea in terms of performance but I cannot clearly
> explain
> > > > > why
> > > > > > since it seems related with how the IntersectTermsEnum#next
> > method
> > > is
> > > > > > implemented.
> > > > > >
> > > > > > The Javadoc of the class says that the terms index (the
> burst-trie
> > > > > > datastructure) is not used by this implementation of TermsEnum.
> > > > However,
> > > > > > when I see the implementation of the next method I get the
> > impression
> > > > > that
> > > > > > this is not accurate. Aren't we using the trie structure to skip
> parts
> > > > of
> > > > > > the data when  the automaton states do not match?
> > > > > >
> > > > > > Can somebody provide a high-level intutition of what
> > > > > > IntersectTermsEnum#next does? Initially, I thought that it is
> > > > traversing
> > > > > > the whole trie structure (skipping some branches when necessary)
> but
> > I
> > > > > may
> > > > > > be wrong.
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Stamatis
> > > > > >
> > > > >
> > > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Reply via email to