I think we could do this at the Codec level?

For example, for stored fields, the current default format
(Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and
Mode.BEST_COMPRESSION, that are easy for the user to pick.  Both modes use
compression, just at varying levels.

I think for the (new) Lucene84DocValuesFormat, which looks like it will
always compress binary DVs, we could similarly add a Mode, maybe with two
options, COMPRESSED and UNCOMPRESSED?

This way it is fairly simple for users to create a custom Codec subclassing
the default Codec and pick the format they want.  And we can try to figure
out which way it should default.  Our (Amazon's customer facing product
search) usage is admittedly unusual, heavily relying on BINARY doc values
performance per hit collected during matching.  Other search applications
might not see a 40% hit to their red-line throughput :)

Viral could you please open a Jira issue to find a way to make this
configurable?  We can hash out the details on the issue ...

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[email protected]> wrote:

> I guess the compression we added to binary doc values, and for
> postings, seems to have hurt performance in a way that wasn't detected
> in testing when those changes were made, or if it was detected, I
> don't recall any discussion about the tradeoff being made. Now that we
> do see there is a tradeoff, I think we need to have that discussion
> though. I can see that having compression can be a nice win for
> indexes that are huge and may be memory bound, since it can help avoid
> I/O, but for a low-latency case where the index is already memory
> resident, we are willing to pay the price of a larger index to avoid
> the cost of decompression. I think we need to find some way of
> handling both cases. I think our design principle should be to expose
> as few knobs as we can, but in this case I don't see how the code can
> make the decision whether to compress or not, since it really depends
> on external design considerations (how big will the index grow? how
> much RAM will the servers have? what query latency is tolerable?)
> Given that, I think we should find a way to expose some kind of
> configurability. Maybe as a first step, rather than making this
> configurable for each DocValuesType, we could offer a global
> configuration in IndexWriterConfig (compressFields=true/false)?
>
> On Tue, May 19, 2020 at 1:05 AM David Smiley <[email protected]>
> wrote:
> >
> > I don't have a direct answer for you, but your message causes me to
> reflect on how Lucene does *not* give users choice of format on a per-type
> basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is
> annoying.  Ideally the previous simple format would be available for you to
> choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored
> fields formats, term vectors formats, points format.  But when it comes to
> DocValues, it's an all-encompassing format for five different structures.
> So you take it or leave it; all or nothing.  My colleague filed
> https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel
> free to comment there with your opinion if you have one.
> >
> > ~ David
> >
> >
> > On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[email protected]>
> wrote:
> >>
> >> Hi,
> >> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal
> benchmarking. We noticed that with this upgrade our QPS dropped more than
> 40% and also affected latencies. After doing some profiling and reverting
> LUCENE-9211 commit related to BinaryDocValues compression, we recovered
> ~30% of the loss. Did anyone encounter similar situation?
> >>
> >> We rely on BinaryDocValues very heavily. Should this newly introduced
> compression be optional to opt-in?
> >>
> >> Also, any other pointers for on recovering remaining 10% loss. When I
> run benchmark on 8.4 index with 8.5.1 code, performance is very similar to
> 8.4.
> >>
> >> Thanks,
> >> Viral Gandhi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to