Re: BinaryDocValues compression with 8.5.1

Michael McCandless Thu, 21 May 2020 11:39:32 -0700

Thanks Viral!

Mike McCandless


http://blog.mikemccandless.com


On Thu, May 21, 2020 at 2:21 PM Viral Gandhi <[email protected]> wrote:

> Thank you! Opened https://issues.apache.org/jira/browse/LUCENE-9378 to
> address this.
>
> Viral Gandhi
>
> On Wed, 20 May 2020 at 15:27, Michael McCandless <
> [email protected]> wrote:
>
>> I think we could do this at the Codec level?
>>
>> For example, for stored fields, the current default format
>> (Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and
>> Mode.BEST_COMPRESSION, that are easy for the user to pick.  Both modes use
>> compression, just at varying levels.
>>
>> I think for the (new) Lucene84DocValuesFormat, which looks like it will
>> always compress binary DVs, we could similarly add a Mode, maybe with two
>> options, COMPRESSED and UNCOMPRESSED?
>>
>> This way it is fairly simple for users to create a custom Codec
>> subclassing the default Codec and pick the format they want.  And we can
>> try to figure out which way it should default.  Our (Amazon's customer
>> facing product search) usage is admittedly unusual, heavily relying on
>> BINARY doc values performance per hit collected during matching.  Other
>> search applications might not see a 40% hit to their red-line throughput :)
>>
>> Viral could you please open a Jira issue to find a way to make this
>> configurable?  We can hash out the details on the issue ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[email protected]>
>> wrote:
>>
>>> I guess the compression we added to binary doc values, and for
>>> postings, seems to have hurt performance in a way that wasn't detected
>>> in testing when those changes were made, or if it was detected, I
>>> don't recall any discussion about the tradeoff being made. Now that we
>>> do see there is a tradeoff, I think we need to have that discussion
>>> though. I can see that having compression can be a nice win for
>>> indexes that are huge and may be memory bound, since it can help avoid
>>> I/O, but for a low-latency case where the index is already memory
>>> resident, we are willing to pay the price of a larger index to avoid
>>> the cost of decompression. I think we need to find some way of
>>> handling both cases. I think our design principle should be to expose
>>> as few knobs as we can, but in this case I don't see how the code can
>>> make the decision whether to compress or not, since it really depends
>>> on external design considerations (how big will the index grow? how
>>> much RAM will the servers have? what query latency is tolerable?)
>>> Given that, I think we should find a way to expose some kind of
>>> configurability. Maybe as a first step, rather than making this
>>> configurable for each DocValuesType, we could offer a global
>>> configuration in IndexWriterConfig (compressFields=true/false)?
>>>
>>> On Tue, May 19, 2020 at 1:05 AM David Smiley <[email protected]>
>>> wrote:
>>> >
>>> > I don't have a direct answer for you, but your message causes me to
>>> reflect on how Lucene does *not* give users choice of format on a per-type
>>> basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is
>>> annoying.  Ideally the previous simple format would be available for you to
>>> choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored
>>> fields formats, term vectors formats, points format.  But when it comes to
>>> DocValues, it's an all-encompassing format for five different structures.
>>> So you take it or leave it; all or nothing.  My colleague filed
>>> https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel
>>> free to comment there with your opinion if you have one.
>>> >
>>> > ~ David
>>> >
>>> >
>>> > On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi,
>>> >> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal
>>> benchmarking. We noticed that with this upgrade our QPS dropped more than
>>> 40% and also affected latencies. After doing some profiling and reverting
>>> LUCENE-9211 commit related to BinaryDocValues compression, we recovered
>>> ~30% of the loss. Did anyone encounter similar situation?
>>> >>
>>> >> We rely on BinaryDocValues very heavily. Should this newly introduced
>>> compression be optional to opt-in?
>>> >>
>>> >> Also, any other pointers for on recovering remaining 10% loss. When I
>>> run benchmark on 8.4 index with 8.5.1 code, performance is very similar to
>>> 8.4.
>>> >>
>>> >> Thanks,
>>> >> Viral Gandhi
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Re: BinaryDocValues compression with 8.5.1

Reply via email to