I wonder whether it would be worth trying switching from stored fields to doc values. The access patterns are different, so the change would not be trivial, but you might be able to achieve gains this way - I really am not sure whether or not you would, the storage model is completely different, but if you have a small number of fields, it could be better?
On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <[email protected]> wrote: > > Hi Alexander, > > Sorry that these changes impacted your workload negatively. We're trying to > have sensible defaults for the performance/compression trade-off in the > default codec, and indeed our guidance is to write a custom codec when it > doesn't work. As you identified, Lucene only guarantees backward > compatibility of file formats for the default codec, so if you write a custom > codec you will have to maintain backward compatibility on your own. > > > Are there any less obvious ways to improve the situation for this use case? > > I can't think of other work arounds. > > One approach that is supported consists of rewriting indexes to the default > codec to perform upgrades using `IndexWriter#addIndexes(CodecReader)`. Say > you have a custom codec, you could rewrite it to the default codec, then > upgrade to a new Lucene version, and rewrite the index again using your > custom codec. This doesn't remove the maintenance overhead entirely, but it > helps not have to worry about backward compatibility of file formats. > > > does it make sense to expose related settings so users can tune the > > compression without copying several internal classes? > > Lucene exposes ways to customize stored fields, look at the constructor of > `Lucene90CompressingStoredFieldsFormat` for instance, which allows > configuring block sizes, compression strategies, etc. These classes are > considered internal so the API is not stable, but they could be used to avoid > copying lots of code from Lucene's stored fields format. > > The consensus is that stored fields of the default codec shouldn't expose > more tuning options than BEST_SPEED/BEST_COMPRESSION. This is already quite a > burden in terms of testing and backward compatibility. The idea of exposing > more tuning options has been brought up a few times and rejected. > > Not directly related to your question, but possibly still of interest to you: > - We're now tracking the performance of stored fields on small documents > nightly: > http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html. > - If you're seeing a 30% performance degradation with recent changes to > stored fields, there are good chances that you could improve the performance > of this workload significantly with a custom codec that is lighter on > compression. > > > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov > <[email protected]> wrote: >> >> Hello everyone, >> We are in the process of upgrading from Lucene 8.5.0 and on the latest >> version our query performance tests show significant latency degradation for >> one of the important use cases. In this test, each query retrieves a >> relatively large dataset of 40k documents with a small stored fields payload >> (< 100 bytes per doc). >> It looks like the change which affects this use case was introduced in >> LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times >> higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced >> and since Lucene 8.10 we see about 30% degradation. >> >> It is still a significant performance regression, and in our case query >> latency is more important than index size. Unless I'm missing something, the >> only way to fix that today is to introduce our own Codec, StoredFieldsFormat >> and CompressionMode - an experiment with disabled preset dict and lower >> block size showed that these changes allow to achieve query latency we need >> on Lucene 9.2. While it can solve the problem, there is a concern about >> maintaining our own version of the codec and having more complicated >> upgrades in the future. >> >> Are there any less obvious ways to improve the situation for this use case? >> If not, does it make sense to expose related settings so users can tune the >> compression without copying several internal classes? >> >> Thank you, >> Alex > > > > -- > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
