Re: How to access block-max metadata?

Alex K Mon, 12 Oct 2020 08:21:41 -0700

Thanks Adrien. Very helpful.
The doc for ImpactSource.advanceShallow says it's more efficient than
DocIDSetIterator.advance.
Is that because advanceShallow is skipping entire blocks at a time, whereas
advance is not?
One possible optimization I've explored involves skipping pruned docIDs. I
tried this using .advance() instead of .nextDoc(), but found the
improvement was negligible. I'm thinking maybe advanceShallow() would let
me get that speedup.
- AK


On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <[email protected]> wrote:

> Hi Alex,
>
> The entry point for block-max metadata is TermsEnum#impacts (
>
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> )
> which returns a view of the postings lists that includes block-max
> metadata. In particular, see documentation for ImpactsSource#advanceShallow
> and ImpactsSource#getImpacts (
>
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> ).
>
> You can look at ImpactsDISI to see how this metadata is leveraged in
> practice to turn this metadata into score upper bounds, which is in-turn
> used to skip irrelevant documents.
>
> On Mon, Oct 12, 2020 at 2:45 AM Alex K <[email protected]> wrote:
>
> > Hi all,
> > There was some fairly recent work in Lucene to introduce Block-Max WAND
> > Scoring (
> >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > , https://issues.apache.org/jira/browse/LUCENE-8135).
> >
> > I've been working on a use-case where I need very efficient top-k scoring
> > for 100s of query terms (usually between 300 and 600 terms, k between 100
> > and 10000, each term contributes a simple TF-IDF score). There's some
> > discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
> >
> > Now that block-based metadata are presumably available in Lucene, how
> would
> > I access this metadata?
> >
> > I've read the WANDScorer.java code, but I couldn't quite understand how
> > exactly it is leveraging a block-max codec or block-based statistics. In
> my
> > own code, I'm exploring some ways to prune low-quality docs, and I
> figured
> > there might be some block-max metadata that I can access to improve the
> > pruning. I'm iterating over the docs matching each term using the
> > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > block-related methods on the PostingsEnum interface. I feel like I'm
> > missing something.. hopefully something simple!
> >
> > I appreciate any tips or examples!
> >
> > Thanks,
> > Alex
> >
>
>
> --
> Adrien
>

Re: How to access block-max metadata?

Reply via email to