Re: How to access block-max metadata?

Alex K Mon, 12 Oct 2020 08:49:48 -0700

I see. So I'm most likely rarely skipping a block's worth of docs, so using
advance() vs nextDoc() doesn't make much of a difference.
All good to know. Thank you.


On Mon, Oct 12, 2020 at 11:42 AM Adrien Grand <jpou...@gmail.com> wrote:

> advanceShallow is indeed faster than advance because it does less:
> advanceShallow only advances the cursor for block-max metadata, this allows
> reasoning about maximum scores without actually advancing the doc ID.
> advanceShallow is implicitly called via advance.
>
> If your optimization rarely helps skip entire blocks, then it's expected
> that advance doesn't help much over nextDoc. advanceShallow is rarely a
> drop-in replacement for advance since it's unable to tell whether a
> document matches or not, it can only be used to reason about maximum scores
> for a range of doc IDs when combined with ImpactsSource#getImpacts.
>
> On Mon, Oct 12, 2020 at 5:21 PM Alex K <aklib...@gmail.com> wrote:
>
> > Thanks Adrien. Very helpful.
> > The doc for ImpactSource.advanceShallow says it's more efficient than
> > DocIDSetIterator.advance.
> > Is that because advanceShallow is skipping entire blocks at a time,
> whereas
> > advance is not?
> > One possible optimization I've explored involves skipping pruned docIDs.
> I
> > tried this using .advance() instead of .nextDoc(), but found the
> > improvement was negligible. I'm thinking maybe advanceShallow() would let
> > me get that speedup.
> > - AK
> >
> > On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <jpou...@gmail.com> wrote:
> >
> > > Hi Alex,
> > >
> > > The entry point for block-max metadata is TermsEnum#impacts (
> > >
> > >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> > > )
> > > which returns a view of the postings lists that includes block-max
> > > metadata. In particular, see documentation for
> > ImpactsSource#advanceShallow
> > > and ImpactsSource#getImpacts (
> > >
> > >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> > > ).
> > >
> > > You can look at ImpactsDISI to see how this metadata is leveraged in
> > > practice to turn this metadata into score upper bounds, which is
> in-turn
> > > used to skip irrelevant documents.
> > >
> > > On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklib...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > > There was some fairly recent work in Lucene to introduce Block-Max
> WAND
> > > > Scoring (
> > > >
> > > >
> > >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > > > , https://issues.apache.org/jira/browse/LUCENE-8135).
> > > >
> > > > I've been working on a use-case where I need very efficient top-k
> > scoring
> > > > for 100s of query terms (usually between 300 and 600 terms, k between
> > 100
> > > > and 10000, each term contributes a simple TF-IDF score). There's some
> > > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160
> .
> > > >
> > > > Now that block-based metadata are presumably available in Lucene, how
> > > would
> > > > I access this metadata?
> > > >
> > > > I've read the WANDScorer.java code, but I couldn't quite understand
> how
> > > > exactly it is leveraging a block-max codec or block-based statistics.
> > In
> > > my
> > > > own code, I'm exploring some ways to prune low-quality docs, and I
> > > figured
> > > > there might be some block-max metadata that I can access to improve
> the
> > > > pruning. I'm iterating over the docs matching each term using the
> > > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > > > block-related methods on the PostingsEnum interface. I feel like I'm
> > > > missing something.. hopefully something simple!
> > > >
> > > > I appreciate any tips or examples!
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Adrien
>

Re: How to access block-max metadata?

Reply via email to