I see. So I'm most likely rarely skipping a block's worth of docs, so using advance() vs nextDoc() doesn't make much of a difference. All good to know. Thank you.
On Mon, Oct 12, 2020 at 11:42 AM Adrien Grand <jpou...@gmail.com> wrote: > advanceShallow is indeed faster than advance because it does less: > advanceShallow only advances the cursor for block-max metadata, this allows > reasoning about maximum scores without actually advancing the doc ID. > advanceShallow is implicitly called via advance. > > If your optimization rarely helps skip entire blocks, then it's expected > that advance doesn't help much over nextDoc. advanceShallow is rarely a > drop-in replacement for advance since it's unable to tell whether a > document matches or not, it can only be used to reason about maximum scores > for a range of doc IDs when combined with ImpactsSource#getImpacts. > > On Mon, Oct 12, 2020 at 5:21 PM Alex K <aklib...@gmail.com> wrote: > > > Thanks Adrien. Very helpful. > > The doc for ImpactSource.advanceShallow says it's more efficient than > > DocIDSetIterator.advance. > > Is that because advanceShallow is skipping entire blocks at a time, > whereas > > advance is not? > > One possible optimization I've explored involves skipping pruned docIDs. > I > > tried this using .advance() instead of .nextDoc(), but found the > > improvement was negligible. I'm thinking maybe advanceShallow() would let > > me get that speedup. > > - AK > > > > On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <jpou...@gmail.com> wrote: > > > > > Hi Alex, > > > > > > The entry point for block-max metadata is TermsEnum#impacts ( > > > > > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int) > > > ) > > > which returns a view of the postings lists that includes block-max > > > metadata. In particular, see documentation for > > ImpactsSource#advanceShallow > > > and ImpactsSource#getImpacts ( > > > > > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html > > > ). > > > > > > You can look at ImpactsDISI to see how this metadata is leveraged in > > > practice to turn this metadata into score upper bounds, which is > in-turn > > > used to skip irrelevant documents. > > > > > > On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklib...@gmail.com> wrote: > > > > > > > Hi all, > > > > There was some fairly recent work in Lucene to introduce Block-Max > WAND > > > > Scoring ( > > > > > > > > > > > > > > https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf > > > > , https://issues.apache.org/jira/browse/LUCENE-8135). > > > > > > > > I've been working on a use-case where I need very efficient top-k > > scoring > > > > for 100s of query terms (usually between 300 and 600 terms, k between > > 100 > > > > and 10000, each term contributes a simple TF-IDF score). There's some > > > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160 > . > > > > > > > > Now that block-based metadata are presumably available in Lucene, how > > > would > > > > I access this metadata? > > > > > > > > I've read the WANDScorer.java code, but I couldn't quite understand > how > > > > exactly it is leveraging a block-max codec or block-based statistics. > > In > > > my > > > > own code, I'm exploring some ways to prune low-quality docs, and I > > > figured > > > > there might be some block-max metadata that I can access to improve > the > > > > pruning. I'm iterating over the docs matching each term using the > > > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any > > > > block-related methods on the PostingsEnum interface. I feel like I'm > > > > missing something.. hopefully something simple! > > > > > > > > I appreciate any tips or examples! > > > > > > > > Thanks, > > > > Alex > > > > > > > > > > > > > -- > > > Adrien > > > > > > > > -- > Adrien >