Dear Adrien, We found that the regression of match-all is not caused by the PostingList format, and instead it's caused by MaxScoreBulkScorer class. Let me create a new email thread about it since the tile of this email thread is N/A anymore.
On Wed, Sep 11, 2024 at 6:24 PM Rui Wu <rui...@mongodb.com> wrote: > Thanks for your prompt reply! > > On Tue, Sep 10, 2024 at 1:38 PM Adrien Grand <jpou...@gmail.com> wrote: > >> Can you clarify what you refer to by match-all and match-many queries? >> Lucene's MatchAllDocsQuery should not be impacted since it doesn't use >> postings for evaluation. >> > match-all refers to a query that hits all docs, e.g. a term query with > term of "A", and every doc has a term "A". match-many refers to a query > that hits a high percentage of the total docs. > >> >> Since FOR is a bit less space-efficient than PFOR, I guess it could be a >> bit slower if your Directory abstraction was a bit slow at reading data. >> Are you using Lucene's MMapDirectory? >> > Yes, we use mmap for posting list index files. > >> >> Elasticsearch indeed only retained PFOR for space-efficiency reasons. We >> have many indexes that use IndexOptions.DOCS where the move from PFOR to >> FOR significantly increased disk usage (unlike indexes that use >> IndexOptions.DOCS_AND_FREQS_AND_POSITIONS where space is typically >> dominated by positions anyway). >> > Got it. Thanks! > >> >> On Tue, Sep 10, 2024 at 9:31 PM Rui Wu <rui...@mongodb.com.invalid> >> wrote: >> >> > Dear experts, >> > >> > I have a question about the following change: >> > The Lucene9.11 changed the Posting list format >> > (Lucene GITHUB#12696 <https://github.com/apache/lucene/pull/12696>: >> Change >> > Postings back to using FOR in Lucene99PostingsFormat. Freqs, positions >> and >> > offset keep using PFOR) >> > >> > However, in our (Mongodb Atlas Search) internal performance testing, we >> saw >> > an increase of query latency up to 32% on match-all and match-many >> inverted >> > index based queries, e.g. query.phrase-slop-0 and >> > query.date-facet-match-all. >> > >> > >> > I wonder if the community sees similar performance regressions on some >> > queries for the Lucene99PostingsFormat. >> > >> > This ES PR <https://github.com/elastic/elasticsearch/pull/103601> >> diverged >> > from Lucene. Lucene 9.9 has introduced a new posting format that uses >> FOR >> > instead of PFOR. Elasticsearch prefers the former format, therefore they >> > introduce it as their own posting format here >> > < >> > >> https://github.com/elastic/elasticsearch/tree/main/server/src/main/java/org/elasticsearch/index/codec/postings >> > >. >> > However, ES cited the reason as only being index size increase. >> > >> > Thank you very much! >> > >> >> >> -- >> Adrien >> >