> Is my understanding correct that changing only block size and disabling preset dictionaries are the changes that won't likely require re-indexing and could be as easily carried over to the next Lucene versions? I understand there is no guarantee, but curious to know your opinion because it introduces additional risks to us.
This assessment looks correct to me. On Tue, Jun 7, 2022 at 7:25 PM Alexander Lukyanchikov < [email protected]> wrote: > Hi Adrien, Michael > Thank you, your responses are very helpful. > > > We're trying to have sensible defaults for the performance/compression > trade-off in the default codec > Sure, the compression improvement achieved with these changes is amazing > and the fetch speed tradeoff makes a lot of sense since it's likely > unnoticeable for a general use case with larger stored fields payload. > > > One approach that is supported consists of rewriting indexes to the > default codec to perform upgrades using > `IndexWriter#addIndexes(CodecReader)` > > That indeed could be really useful, although an ability to upgrade from > the previous Lucene version without re-indexing is very important for us. *Is > my understanding correct that changing only block size and disabling preset > dictionaries are the changes that won't likely require re-indexing and > could be as easily carried over to the next Lucene versions? I understand > there is no guarantee, but curious to know your opinion because it > introduces additional risks to us.* > > > I wonder whether it would be worth trying switching from stored fields > to doc values > > Yes, that is something we considered before, but discarded due to access > patterns specifics and the fact that payload size can also be large in some > cases. Although in the future we will likely need to use doc values for a > less generic feature, where small size is guaranteed. > > Regards, > Alex > > > On Tue, Jun 7, 2022 at 5:45 AM Michael Sokolov <[email protected]> wrote: > >> I wonder whether it would be worth trying switching from stored fields >> to doc values. The access patterns are different, so the change would >> not be trivial, but you might be able to achieve gains this way - I >> really am not sure whether or not you would, the storage model is >> completely different, but if you have a small number of fields, it >> could be better? >> >> On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <[email protected]> wrote: >> > >> > Hi Alexander, >> > >> > Sorry that these changes impacted your workload negatively. We're >> trying to have sensible defaults for the performance/compression trade-off >> in the default codec, and indeed our guidance is to write a custom codec >> when it doesn't work. As you identified, Lucene only guarantees backward >> compatibility of file formats for the default codec, so if you write a >> custom codec you will have to maintain backward compatibility on your own. >> > >> > > Are there any less obvious ways to improve the situation for this use >> case? >> > >> > I can't think of other work arounds. >> > >> > One approach that is supported consists of rewriting indexes to the >> default codec to perform upgrades using >> `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you >> could rewrite it to the default codec, then upgrade to a new Lucene >> version, and rewrite the index again using your custom codec. This doesn't >> remove the maintenance overhead entirely, but it helps not have to worry >> about backward compatibility of file formats. >> > >> > > does it make sense to expose related settings so users can tune the >> compression without copying several internal classes? >> > >> > Lucene exposes ways to customize stored fields, look at the constructor >> of `Lucene90CompressingStoredFieldsFormat` for instance, which allows >> configuring block sizes, compression strategies, etc. These classes are >> considered internal so the API is not stable, but they could be used to >> avoid copying lots of code from Lucene's stored fields format. >> > >> > The consensus is that stored fields of the default codec shouldn't >> expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is >> already quite a burden in terms of testing and backward compatibility. The >> idea of exposing more tuning options has been brought up a few times and >> rejected. >> > >> > Not directly related to your question, but possibly still of interest >> to you: >> > - We're now tracking the performance of stored fields on small >> documents nightly: >> http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html >> . >> > - If you're seeing a 30% performance degradation with recent changes >> to stored fields, there are good chances that you could improve the >> performance of this workload significantly with a custom codec that is >> lighter on compression. >> > >> > >> > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov < >> [email protected]> wrote: >> >> >> >> Hello everyone, >> >> We are in the process of upgrading from Lucene 8.5.0 and on the latest >> version our query performance tests show significant latency degradation >> for one of the important use cases. In this test, each query retrieves a >> relatively large dataset of 40k documents with a small stored fields >> payload (< 100 bytes per doc). >> >> It looks like the change which affects this use case was introduced in >> LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times >> higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced >> and since Lucene 8.10 we see about 30% degradation. >> >> >> >> It is still a significant performance regression, and in our case >> query latency is more important than index size. Unless I'm missing >> something, the only way to fix that today is to introduce our own Codec, >> StoredFieldsFormat and CompressionMode - an experiment with disabled preset >> dict and lower block size showed that these changes allow to achieve query >> latency we need on Lucene 9.2. While it can solve the problem, there is a >> concern about maintaining our own version of the codec and having more >> complicated upgrades in the future. >> >> >> >> Are there any less obvious ways to improve the situation for this use >> case? If not, does it make sense to expose related settings so users can >> tune the compression without copying several internal classes? >> >> >> >> Thank you, >> >> Alex >> > >> > >> > >> > -- >> > Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> -- Adrien
