advanceShallow is indeed faster than advance because it does less: advanceShallow only advances the cursor for block-max metadata, this allows reasoning about maximum scores without actually advancing the doc ID. advanceShallow is implicitly called via advance.
If your optimization rarely helps skip entire blocks, then it's expected that advance doesn't help much over nextDoc. advanceShallow is rarely a drop-in replacement for advance since it's unable to tell whether a document matches or not, it can only be used to reason about maximum scores for a range of doc IDs when combined with ImpactsSource#getImpacts. On Mon, Oct 12, 2020 at 5:21 PM Alex K <aklib...@gmail.com> wrote: > Thanks Adrien. Very helpful. > The doc for ImpactSource.advanceShallow says it's more efficient than > DocIDSetIterator.advance. > Is that because advanceShallow is skipping entire blocks at a time, whereas > advance is not? > One possible optimization I've explored involves skipping pruned docIDs. I > tried this using .advance() instead of .nextDoc(), but found the > improvement was negligible. I'm thinking maybe advanceShallow() would let > me get that speedup. > - AK > > On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <jpou...@gmail.com> wrote: > > > Hi Alex, > > > > The entry point for block-max metadata is TermsEnum#impacts ( > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int) > > ) > > which returns a view of the postings lists that includes block-max > > metadata. In particular, see documentation for > ImpactsSource#advanceShallow > > and ImpactsSource#getImpacts ( > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html > > ). > > > > You can look at ImpactsDISI to see how this metadata is leveraged in > > practice to turn this metadata into score upper bounds, which is in-turn > > used to skip irrelevant documents. > > > > On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklib...@gmail.com> wrote: > > > > > Hi all, > > > There was some fairly recent work in Lucene to introduce Block-Max WAND > > > Scoring ( > > > > > > > > > https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf > > > , https://issues.apache.org/jira/browse/LUCENE-8135). > > > > > > I've been working on a use-case where I need very efficient top-k > scoring > > > for 100s of query terms (usually between 300 and 600 terms, k between > 100 > > > and 10000, each term contributes a simple TF-IDF score). There's some > > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160. > > > > > > Now that block-based metadata are presumably available in Lucene, how > > would > > > I access this metadata? > > > > > > I've read the WANDScorer.java code, but I couldn't quite understand how > > > exactly it is leveraging a block-max codec or block-based statistics. > In > > my > > > own code, I'm exploring some ways to prune low-quality docs, and I > > figured > > > there might be some block-max metadata that I can access to improve the > > > pruning. I'm iterating over the docs matching each term using the > > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any > > > block-related methods on the PostingsEnum interface. I feel like I'm > > > missing something.. hopefully something simple! > > > > > > I appreciate any tips or examples! > > > > > > Thanks, > > > Alex > > > > > > > > > -- > > Adrien > > > -- Adrien