Re: 30% query performance degradation for documents with small stored fields

Adrien Grand Mon, 13 Jun 2022 09:44:45 -0700

> Is my understanding correct that changing only block size and disabling
preset dictionaries are the changes that won't likely require re-indexing
and could be as easily carried over to the next Lucene versions? I
understand there is no guarantee, but curious to know your opinion because
it introduces additional risks to us.


This assessment looks correct to me.

On Tue, Jun 7, 2022 at 7:25 PM Alexander Lukyanchikov <
[email protected]> wrote:

> Hi Adrien, Michael
> Thank you, your responses are very helpful.
>
> > We're trying to have sensible defaults for the performance/compression
> trade-off in the default codec
> Sure, the compression improvement achieved with these changes is amazing
> and the fetch speed tradeoff makes a lot of sense since it's likely
> unnoticeable for a general use case with larger stored fields payload.
>
> > One approach that is supported consists of rewriting indexes to the
> default codec to perform upgrades using
> `IndexWriter#addIndexes(CodecReader)`
>
> That indeed could be really useful, although an ability to upgrade from
> the previous Lucene version without re-indexing is very important for us. *Is
> my understanding correct that changing only block size and disabling preset
> dictionaries are the changes that won't likely require re-indexing and
> could be as easily carried over to the next Lucene versions? I understand
> there is no guarantee, but curious to know your opinion because it
> introduces additional risks to us.*
>
> > I wonder whether it would be worth trying switching from stored fields
> to doc values
>
> Yes, that is something we considered before, but discarded due to access
> patterns specifics and the fact that payload size can also be large in some
> cases. Although in the future we will likely need to use doc values for a
> less generic feature, where small size is guaranteed.
>
> Regards,
> Alex
>
>
> On Tue, Jun 7, 2022 at 5:45 AM Michael Sokolov <[email protected]> wrote:
>
>> I wonder whether it would be worth trying switching from stored fields
>> to doc values. The access patterns are different, so the change would
>> not be trivial, but you might be able to achieve gains this way - I
>> really am not sure whether or not you would, the storage model is
>> completely different, but if you have a small number of fields, it
>> could be better?
>>
>> On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <[email protected]> wrote:
>> >
>> > Hi Alexander,
>> >
>> > Sorry that these changes impacted your workload negatively. We're
>> trying to have sensible defaults for the performance/compression trade-off
>> in the default codec, and indeed our guidance is to write a custom codec
>> when it doesn't work. As you identified, Lucene only guarantees backward
>> compatibility of file formats for the default codec, so if you write a
>> custom codec you will have to maintain backward compatibility on your own.
>> >
>> > > Are there any less obvious ways to improve the situation for this use
>> case?
>> >
>> > I can't think of other work arounds.
>> >
>> > One approach that is supported consists of rewriting indexes to the
>> default codec to perform upgrades using
>> `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you
>> could rewrite it to the default codec, then upgrade to a new Lucene
>> version, and rewrite the index again using your custom codec. This doesn't
>> remove the maintenance overhead entirely, but it helps not have to worry
>> about backward compatibility of file formats.
>> >
>> > > does it make sense to expose related settings so users can tune the
>> compression without copying several internal classes?
>> >
>> > Lucene exposes ways to customize stored fields, look at the constructor
>> of `Lucene90CompressingStoredFieldsFormat` for instance, which allows
>> configuring block sizes, compression strategies, etc. These classes are
>> considered internal so the API is not stable, but they could be used to
>> avoid copying lots of code from Lucene's stored fields format.
>> >
>> > The consensus is that stored fields of the default codec shouldn't
>> expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is
>> already quite a burden in terms of testing and backward compatibility. The
>> idea of exposing more tuning options has been brought up a few times and
>> rejected.
>> >
>> > Not directly related to your question, but possibly still of interest
>> to you:
>> >  - We're now tracking the performance of stored fields on small
>> documents nightly:
>> http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
>> .
>> >  - If you're seeing a 30% performance degradation with recent changes
>> to stored fields, there are good chances that you could improve the
>> performance of this workload significantly with a custom codec that is
>> lighter on compression.
>> >
>> >
>> > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
>> [email protected]> wrote:
>> >>
>> >> Hello everyone,
>> >> We are in the process of upgrading from Lucene 8.5.0 and on the latest
>> version our query performance tests show significant latency degradation
>> for one of the important use cases. In this test, each query retrieves a
>> relatively large dataset of 40k documents with a small stored fields
>> payload (< 100 bytes per doc).
>> >> It looks like the change which affects this use case was introduced in
>> LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times
>> higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced
>> and since Lucene 8.10 we see about 30% degradation.
>> >>
>> >> It is still a significant performance regression, and in our case
>> query latency is more important than index size. Unless I'm missing
>> something, the only way to fix that today is to introduce our own Codec,
>> StoredFieldsFormat and CompressionMode - an experiment with disabled preset
>> dict and lower block size showed that these changes allow to achieve query
>> latency we need on Lucene 9.2. While it can solve the problem, there is a
>> concern about maintaining our own version of the codec and having more
>> complicated upgrades in the future.
>> >>
>> >> Are there any less obvious ways to improve the situation for this use
>> case? If not, does it make sense to expose related settings so users can
>> tune the compression without copying several internal classes?
>> >>
>> >> Thank you,
>> >> Alex
>> >
>> >
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

-- 
Adrien

Re: 30% query performance degradation for documents with small stored fields

Reply via email to