Hi Adrien, Michael
Thank you, your responses are very helpful.

> We're trying to have sensible defaults for the performance/compression
trade-off in the default codec
Sure, the compression improvement achieved with these changes is amazing
and the fetch speed tradeoff makes a lot of sense since it's likely
unnoticeable for a general use case with larger stored fields payload.

> One approach that is supported consists of rewriting indexes to the
default codec to perform upgrades using
`IndexWriter#addIndexes(CodecReader)`

That indeed could be really useful, although an ability to upgrade from the
previous Lucene version without re-indexing is very important for us. *Is
my understanding correct that changing only block size and disabling preset
dictionaries are the changes that won't likely require re-indexing and
could be as easily carried over to the next Lucene versions? I understand
there is no guarantee, but curious to know your opinion because it
introduces additional risks to us.*

> I wonder whether it would be worth trying switching from stored fields to
doc values

Yes, that is something we considered before, but discarded due to access
patterns specifics and the fact that payload size can also be large in some
cases. Although in the future we will likely need to use doc values for a
less generic feature, where small size is guaranteed.

Regards,
Alex


On Tue, Jun 7, 2022 at 5:45 AM Michael Sokolov <[email protected]> wrote:

> I wonder whether it would be worth trying switching from stored fields
> to doc values. The access patterns are different, so the change would
> not be trivial, but you might be able to achieve gains this way - I
> really am not sure whether or not you would, the storage model is
> completely different, but if you have a small number of fields, it
> could be better?
>
> On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <[email protected]> wrote:
> >
> > Hi Alexander,
> >
> > Sorry that these changes impacted your workload negatively. We're trying
> to have sensible defaults for the performance/compression trade-off in the
> default codec, and indeed our guidance is to write a custom codec when it
> doesn't work. As you identified, Lucene only guarantees backward
> compatibility of file formats for the default codec, so if you write a
> custom codec you will have to maintain backward compatibility on your own.
> >
> > > Are there any less obvious ways to improve the situation for this use
> case?
> >
> > I can't think of other work arounds.
> >
> > One approach that is supported consists of rewriting indexes to the
> default codec to perform upgrades using
> `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you
> could rewrite it to the default codec, then upgrade to a new Lucene
> version, and rewrite the index again using your custom codec. This doesn't
> remove the maintenance overhead entirely, but it helps not have to worry
> about backward compatibility of file formats.
> >
> > > does it make sense to expose related settings so users can tune the
> compression without copying several internal classes?
> >
> > Lucene exposes ways to customize stored fields, look at the constructor
> of `Lucene90CompressingStoredFieldsFormat` for instance, which allows
> configuring block sizes, compression strategies, etc. These classes are
> considered internal so the API is not stable, but they could be used to
> avoid copying lots of code from Lucene's stored fields format.
> >
> > The consensus is that stored fields of the default codec shouldn't
> expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is
> already quite a burden in terms of testing and backward compatibility. The
> idea of exposing more tuning options has been brought up a few times and
> rejected.
> >
> > Not directly related to your question, but possibly still of interest to
> you:
> >  - We're now tracking the performance of stored fields on small
> documents nightly:
> http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
> .
> >  - If you're seeing a 30% performance degradation with recent changes to
> stored fields, there are good chances that you could improve the
> performance of this workload significantly with a custom codec that is
> lighter on compression.
> >
> >
> > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
> [email protected]> wrote:
> >>
> >> Hello everyone,
> >> We are in the process of upgrading from Lucene 8.5.0 and on the latest
> version our query performance tests show significant latency degradation
> for one of the important use cases. In this test, each query retrieves a
> relatively large dataset of 40k documents with a small stored fields
> payload (< 100 bytes per doc).
> >> It looks like the change which affects this use case was introduced in
> LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times
> higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced
> and since Lucene 8.10 we see about 30% degradation.
> >>
> >> It is still a significant performance regression, and in our case query
> latency is more important than index size. Unless I'm missing something,
> the only way to fix that today is to introduce our own Codec,
> StoredFieldsFormat and CompressionMode - an experiment with disabled preset
> dict and lower block size showed that these changes allow to achieve query
> latency we need on Lucene 9.2. While it can solve the problem, there is a
> concern about maintaining our own version of the codec and having more
> complicated upgrades in the future.
> >>
> >> Are there any less obvious ways to improve the situation for this use
> case? If not, does it make sense to expose related settings so users can
> tune the compression without copying several internal classes?
> >>
> >> Thank you,
> >> Alex
> >
> >
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to