Hi Alexander,

Sorry that these changes impacted your workload negatively. We're trying to
have sensible defaults for the performance/compression trade-off in the
default codec, and indeed our guidance is to write a custom codec when it
doesn't work. As you identified, Lucene only guarantees backward
compatibility of file formats for the default codec, so if you write a
custom codec you will have to maintain backward compatibility on your own.

> Are there any less obvious ways to improve the situation for this use
case?

I can't think of other work arounds.

One approach that is supported consists of rewriting indexes to the default
codec to perform upgrades using `IndexWriter#addIndexes(CodecReader)`. Say
you have a custom codec, you could rewrite it to the default codec, then
upgrade to a new Lucene version, and rewrite the index again using your
custom codec. This doesn't remove the maintenance overhead entirely, but it
helps not have to worry about backward compatibility of file formats.

> does it make sense to expose related settings so users can tune the
compression without copying several internal classes?

Lucene exposes ways to customize stored fields, look at the constructor of
`Lucene90CompressingStoredFieldsFormat` for instance, which allows
configuring block sizes, compression strategies, etc. These classes are
considered internal so the API is not stable, but they could be used to
avoid copying lots of code from Lucene's stored fields format.

The consensus is that stored fields of the default codec shouldn't expose
more tuning options than BEST_SPEED/BEST_COMPRESSION. This is already quite
a burden in terms of testing and backward compatibility. The idea of
exposing more tuning options has been brought up a few times and rejected.

Not directly related to your question, but possibly still of interest to
you:
 - We're now tracking the performance of stored fields on small documents
nightly:
http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
.
 - If you're seeing a 30% performance degradation with recent changes to
stored fields, there are good chances that you could improve the
performance of this workload significantly with a custom codec that is
lighter on compression.


On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
[email protected]> wrote:

> Hello everyone,
> We are in the process of upgrading from Lucene 8.5.0 and on the latest
> version our query performance tests show significant latency degradation
> for one of the important use cases. In this test, each query retrieves a
> relatively large dataset of 40k documents with a small stored fields
> payload (< 100 bytes per doc).
> It looks like the change which affects this use case was introduced in
> LUCENE-9486 <https://issues.apache.org/jira/browse/LUCENE-9486> (Lucene
> 8.7), on this version our tests show almost 3 times higher latency. Later
> in LUCENE-9917 <https://issues.apache.org/jira/browse/LUCENE-9917> block
> size for BEST_SPEED was reduced and since Lucene 8.10 we see about 30%
> degradation.
>
> It is still a significant performance regression, and in our case query
> latency is more important than index size. Unless I'm missing something,
> the only way to fix that today is to introduce our own Codec,
> StoredFieldsFormat and CompressionMode - an experiment with disabled preset
> dict and lower block size showed that these changes allow to achieve query
> latency we need on Lucene 9.2. While it can solve the problem, there is a
> concern about maintaining our own version of the codec and having more
> complicated upgrades in the future.
>
> Are there any less obvious ways to improve the situation for this use
> case? If not, does it make sense to expose related settings so users can
> tune the compression without copying several internal classes?
>
> Thank you,
> Alex
>


-- 
Adrien

Reply via email to