Hey, Full disclosure -- I sit at a desk next to Aniketh, so we chatted about this one in the real world.
Our current working theory is as follows: We're using the TOP_SCORES score mode on both the old and new code paths. On the old code path, we were returning a BlockImpactsDocsEnum, even though we didn't have frequencies. I'm guessing that the impacts were constant across all blocks, since we have no frequencies and no norms. It was fast because (once the collector has filled its priority queue), we'd check the (constant) impacts to find the first block that's strictly better than the min competitive score. Since all scores are equal, that would quickly skip to the end. On the new code path, we always return the SlowImpactsEnum, since we don't have frequencies. Once we fill up the collector's priority queue, we're not able to do any impact checks to try to find a competitive block, so we iterate through all the doc IDs for the term. The best solution would probably be to flip the score mode from TOP_SCORES to TOP_DOCS, and avoid looking at impacts altogether, since we don't have frequencies anyway. Then early termination logic will kick in, which is arguably even better than the previous fast skipping logic. We can address this in OpenSearch by just wrapping things in a ConstantScoreQuery if frequencies are not enabled for the field. I'm wondering if it would make sense to push a scoreMode override into TermQuery's createWeight. If the field doesn't have frequencies, would it make sense to do a similar scoreMode override to ConstantScoreQuery (i.e. COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)? Thanks, Froh On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com> wrote: > Hey folks, > > While investigating a regression in OpenSearch versions 2.17.1 ( Lucene > 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5 > workload over process.name field, I noticed that the new > Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum > over postings when a field only has IndexOptions.DOCS > > curl -X POST "http://localhost:9200/big5/_search" -H "Content-Type: > application/json" -d '{ "query": { "term": { "process.name": "kernel" } } > }' > > > Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo, > BlockTermState state, int flags) Has an extra check on *indexHasFreqs* > > if (state.docFreq >= BLOCK_SIZE > && indexHasFreqs > && (indexHasPositions == false > || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == > false)) { > return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state); > } > > > Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum for > fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when > document frequency is less than 128 ( Block size ) > > > Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo, > BlockTermState state, int flags) > > if (state.docFreq <= BLOCK_SIZE) { > // no skip data > return new SlowImpactsEnum(postings(fieldInfo, state, null, flags)); > } > > > if (indexHasPositions == false > || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == false) > { > return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state); > } > > > > Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for > advanceShallow method, the Term Query is never able to skip data when > called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow > gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster > completion. > The difference with 116 million docs of Big5 index is >200ms in Lucene > 9.12.0 to <=5ms in Lucene 9.11.1 > > I tried reindexing the process.name into another index but with > docs_and_freqs enabled and the query latency came back to normal since it > uses BlockImpactsDocsEnum as its ImpactsEnum. > > Is this a bug in the 912 postings reader ? Or is it not possible to use > the BlockImpactsDocsEnum with the new postings format ? > > > Thanks, > Aniketh >