RE: Payloads and TrieRangeQuery

Uwe Schindler Wed, 10 Jun 2009 12:07:49 -0700

> Ooh that sounds compelling!
> 
> So you would not need to use payloads for the "inside" brackets,
> right?  Only for the edges?


Exactly.

> I wonder how performance would compare.  Without payloads, there are
> many more terms (for the tiny ranges) in the index, and your OR query
> will have lots of these tiny terms.  But then these tiny terms don't
> hit many docs, and with BooleanScorer (which we should switch to for
> OR queries) ought not be very costly. 

That ist true. The main idea was to limit also seeking during the query.
When splitting the range, you need to often start new TermEnums and iterate
over lot of term. By catching many docs with less terms, you only need to
scan forward in the payloads.

> Vs w/ payloads having to use
> TermPositions, having to load, decode & check the payload, and I guess
> assuming on average that 1/2 the docs are filtered out.

Maybe decoding the payload is not needed, I would encode the bounds as
byte[] and compare the arrays. But you would filter about half of the docs
out.

My problem with all this is how to optimize after which shift value to
switch between terms and payloads. And this information about the trie
structure and where payloads are should be stored in FieldInfos.

As we now search on each segment separately, this information can be stored
per segment and also used for each per-segment Filter/Scorer.

The whole thing works out of the box with TrieRangeFilter (its just
iterating over terms, getting TermDocs/TermPositions and setting bits, when
payloads available after checking these), for TrieRangeQuery using
BooleanQuery it is more complicated (MTQ cannot simply add the terms from
the FilteredTermEnum to a BooleanQuery).

Until now I had no time to think about it in detail, but with maybe the
possibility to have TrieRange in Core and store trie-specific FieldInfos per
segment, I will get clearer how to manage this in the API.

Uwe

> On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler<u...@thetaphi.de> wrote:
> > Hi, sorry I missed the first mail.
> >
> >
> >
> > The idea we discussed in Amsterdam during ApacheCon was:
> >
> >
> >
> > Instead of indexing all trie precisions from e.g. the leftmost 8 bits
> downto
> > all 64 bits, the TrieTokenStream only creates terms from e.g. precisions
> 8
> > to 56. The last precision is left out. Instead the last term (precision
> 56)
> > contains the highest precision as payload.
> >
> > On the query side, TrieRangeQuery would create the filter bitmap as
> before
> > until it reaches the lowest available precision with the payloads.
> Instead
> > of further splitting this precision into terms, all TermPositions
> instead of
> > just TermDocs are listed, but only those set in the result BitSet, that
> have
> > the payload inside the range bounds. By this the trie query first
> selects
> > large ranges in the middle like before, but uses the highest (but not
> full
> > precision term) to select more docids than needed but filters them with
> the
> > payload.
> >
> >
> >
> > With String Dates (the simplified example Michael Busch shows in his
> talk):
> >
> > Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> > variant would select terms 2005-11-10 to 2005-11-30, then the whole
> > December, the whole years 2006 and 2007 and so on. With payloads,
> trierange
> > would select only whole months (November, December, 2006, 2007, Jan,
> Feb,
> > Mar). At the ends the payloads are used to filter out the days in Nov
> 2005
> > and Mar 2008.
> >
> >
> >
> > With the latest TrieRange impl this would be possible to implement
> (because
> > the TrieTokenStreams now used for indexing could create the payloads).
> Only
> > the searching side would no longer so simple implemented as yet. My
> > biggest problem is how to configure this optimal and make the API clean.
> >
> >
> >
> > Was it understandable? (Its complicated, I know)
> >
> >
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > ________________________________
> >
> > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> > Sent: Wednesday, June 10, 2009 7:59 PM
> > To: java-dev@lucene.apache.org
> > Subject: Re: Payloads and TrieRangeQuery
> >
> >
> >
> > I think instead of ORing postings (trie range, rangequery, etc), have a
> > custom Query + Scorer that examines the payload (somehow)?  It could
> encode
> > the multiple levels of trie bits in it?  (I'm just guessing here).
> >
> > On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
> > <luc...@mikemccandless.com> wrote:
> >
> > Use them how?  (Sounds interesting...).
> >
> > Mike
> >
> > On Tue, Jun 9, 2009 at 10:32 PM, Jason
> > Rutherglen<jason.rutherg...@gmail.com> wrote:
> >> At the SF Lucene User's group, Michael Busch mentioned using
> >> payloads with TrieRangeQueries. Is this something that's being
> >> worked on? I'm interested in what sort performance benefits
> >> there would be to this method?
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Payloads and TrieRangeQuery

Reply via email to