Dear Adrien,

We found that the regression of match-all is not caused by the PostingList
format, and instead it's caused by MaxScoreBulkScorer class. Let me create
a new email thread about it since the tile of this email thread is N/A
anymore.

On Wed, Sep 11, 2024 at 6:24 PM Rui Wu <rui...@mongodb.com> wrote:

> Thanks for your prompt reply!
>
> On Tue, Sep 10, 2024 at 1:38 PM Adrien Grand <jpou...@gmail.com> wrote:
>
>> Can you clarify what you refer to by match-all and match-many queries?
>> Lucene's MatchAllDocsQuery should not be impacted since it doesn't use
>> postings for evaluation.
>>
> match-all refers to a query that hits all docs, e.g. a term query with
> term of "A", and every doc has a term "A". match-many refers to a query
> that hits a high percentage of the total docs.
>
>>
>> Since FOR is a bit less space-efficient than PFOR, I guess it could be a
>> bit slower if your Directory abstraction was a bit slow at reading data.
>> Are you using Lucene's MMapDirectory?
>>
> Yes, we use mmap for posting list index files.
>
>>
>> Elasticsearch indeed only retained PFOR for space-efficiency reasons. We
>> have many indexes that use IndexOptions.DOCS where the move from PFOR to
>> FOR significantly increased disk usage (unlike indexes that use
>> IndexOptions.DOCS_AND_FREQS_AND_POSITIONS where space is typically
>> dominated by positions anyway).
>>
> Got it. Thanks!
>
>>
>> On Tue, Sep 10, 2024 at 9:31 PM Rui Wu <rui...@mongodb.com.invalid>
>> wrote:
>>
>> > Dear experts,
>> >
>> > I have a question about the following change:
>> > The Lucene9.11 changed the Posting list format
>> > (Lucene GITHUB#12696 <https://github.com/apache/lucene/pull/12696>:
>> Change
>> > Postings back to using FOR in Lucene99PostingsFormat. Freqs, positions
>> and
>> > offset keep using PFOR)
>> >
>> > However, in our (Mongodb Atlas Search) internal performance testing, we
>> saw
>> > an increase of query latency up to 32% on match-all and match-many
>> inverted
>> > index based queries, e.g. query.phrase-slop-0 and
>> > query.date-facet-match-all.
>> >
>> >
>> > I wonder if the community sees similar performance regressions on some
>> > queries for the Lucene99PostingsFormat.
>> >
>> > This ES PR <https://github.com/elastic/elasticsearch/pull/103601>
>> diverged
>> > from Lucene. Lucene 9.9 has introduced a new posting format that uses
>> FOR
>> > instead of PFOR. Elasticsearch prefers the former format, therefore they
>> > introduce it as their own posting format here
>> > <
>> >
>> https://github.com/elastic/elasticsearch/tree/main/server/src/main/java/org/elasticsearch/index/codec/postings
>> > >.
>> > However, ES cited the reason as only being index size increase.
>> >
>> > Thank you very much!
>> >
>>
>>
>> --
>> Adrien
>>
>

Reply via email to