Hey,

Full disclosure -- I sit at a desk next to Aniketh, so we chatted about
this one in the real world.

Our current working theory is as follows:

We're using the TOP_SCORES score mode on both the old and new code paths.
On the old code path, we were returning a BlockImpactsDocsEnum, even though
we didn't have frequencies. I'm guessing that the impacts were constant
across all blocks, since we have no frequencies and no norms. It was fast
because (once the collector has filled its priority queue), we'd check the
(constant) impacts to find the first block that's strictly better than the
min competitive score. Since all scores are equal, that would quickly skip
to the end. On the new code path, we always return the SlowImpactsEnum,
since we don't have frequencies. Once we fill up the collector's priority
queue, we're not able to do any impact checks to try to find a
competitive block, so we iterate through all the doc IDs for the term.

The best solution would probably be to flip the score mode from TOP_SCORES
to TOP_DOCS, and avoid looking at impacts altogether, since we don't have
frequencies anyway. Then early termination logic will kick in, which is
arguably even better than the previous fast skipping logic. We can address
this in OpenSearch by just wrapping things in a ConstantScoreQuery if
frequencies are not enabled for the field.

I'm wondering if it would make sense to push a scoreMode override into
TermQuery's createWeight. If the field doesn't have frequencies, would it
make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?

Thanks,
Froh


On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com> wrote:

> Hey folks,
>
> While investigating a regression in OpenSearch versions 2.17.1 ( Lucene
> 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5
> workload over process.name field, I noticed that the new
> Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum
> over postings when a field only has IndexOptions.DOCS
>
> curl -X POST "http://localhost:9200/big5/_search"; -H "Content-Type:
> application/json" -d '{ "query": { "term": { "process.name": "kernel" } }
> }'
>
>
> Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
> BlockTermState state, int flags) Has an extra check on *indexHasFreqs*
>
> if (state.docFreq >= BLOCK_SIZE
>     && indexHasFreqs
>     && (indexHasPositions == false
>         || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
> false)) {
>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
> }
>
>
> Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum for
> fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when
> document frequency is less than 128 ( Block size )
>
>
> Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
> BlockTermState state, int flags)
>
> if (state.docFreq <= BLOCK_SIZE) {
>   // no skip data
>   return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
> }
>
>
> if (indexHasPositions == false
>     || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == false) 
> {
>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
> }
>
>
>
> Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for
> advanceShallow method, the Term Query is never able to skip data when
> called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow
> gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster
> completion.
> The difference with 116 million docs of Big5 index is >200ms in Lucene
> 9.12.0 to <=5ms in Lucene 9.11.1
>
> I tried reindexing the process.name into another index but with
> docs_and_freqs enabled and the query latency came back to normal since it
> uses BlockImpactsDocsEnum as its ImpactsEnum.
>
> Is this a bug in the 912 postings reader ? Or is it not possible to use
> the BlockImpactsDocsEnum with the new postings format ?
>
>
> Thanks,
> Aniketh
>

Reply via email to