Thanks Viral! Mike McCandless
http://blog.mikemccandless.com On Thu, May 21, 2020 at 2:21 PM Viral Gandhi <[email protected]> wrote: > Thank you! Opened https://issues.apache.org/jira/browse/LUCENE-9378 to > address this. > > Viral Gandhi > > On Wed, 20 May 2020 at 15:27, Michael McCandless < > [email protected]> wrote: > >> I think we could do this at the Codec level? >> >> For example, for stored fields, the current default format >> (Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and >> Mode.BEST_COMPRESSION, that are easy for the user to pick. Both modes use >> compression, just at varying levels. >> >> I think for the (new) Lucene84DocValuesFormat, which looks like it will >> always compress binary DVs, we could similarly add a Mode, maybe with two >> options, COMPRESSED and UNCOMPRESSED? >> >> This way it is fairly simple for users to create a custom Codec >> subclassing the default Codec and pick the format they want. And we can >> try to figure out which way it should default. Our (Amazon's customer >> facing product search) usage is admittedly unusual, heavily relying on >> BINARY doc values performance per hit collected during matching. Other >> search applications might not see a 40% hit to their red-line throughput :) >> >> Viral could you please open a Jira issue to find a way to make this >> configurable? We can hash out the details on the issue ... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[email protected]> >> wrote: >> >>> I guess the compression we added to binary doc values, and for >>> postings, seems to have hurt performance in a way that wasn't detected >>> in testing when those changes were made, or if it was detected, I >>> don't recall any discussion about the tradeoff being made. Now that we >>> do see there is a tradeoff, I think we need to have that discussion >>> though. I can see that having compression can be a nice win for >>> indexes that are huge and may be memory bound, since it can help avoid >>> I/O, but for a low-latency case where the index is already memory >>> resident, we are willing to pay the price of a larger index to avoid >>> the cost of decompression. I think we need to find some way of >>> handling both cases. I think our design principle should be to expose >>> as few knobs as we can, but in this case I don't see how the code can >>> make the decision whether to compress or not, since it really depends >>> on external design considerations (how big will the index grow? how >>> much RAM will the servers have? what query latency is tolerable?) >>> Given that, I think we should find a way to expose some kind of >>> configurability. Maybe as a first step, rather than making this >>> configurable for each DocValuesType, we could offer a global >>> configuration in IndexWriterConfig (compressFields=true/false)? >>> >>> On Tue, May 19, 2020 at 1:05 AM David Smiley <[email protected]> >>> wrote: >>> > >>> > I don't have a direct answer for you, but your message causes me to >>> reflect on how Lucene does *not* give users choice of format on a per-type >>> basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is >>> annoying. Ideally the previous simple format would be available for you to >>> choose, but it is not. Lucene lets you mix & match PostingsFormats, stored >>> fields formats, term vectors formats, points format. But when it comes to >>> DocValues, it's an all-encompassing format for five different structures. >>> So you take it or leave it; all or nothing. My colleague filed >>> https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel >>> free to comment there with your opinion if you have one. >>> > >>> > ~ David >>> > >>> > >>> > On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[email protected]> >>> wrote: >>> >> >>> >> Hi, >>> >> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal >>> benchmarking. We noticed that with this upgrade our QPS dropped more than >>> 40% and also affected latencies. After doing some profiling and reverting >>> LUCENE-9211 commit related to BinaryDocValues compression, we recovered >>> ~30% of the loss. Did anyone encounter similar situation? >>> >> >>> >> We rely on BinaryDocValues very heavily. Should this newly introduced >>> compression be optional to opt-in? >>> >> >>> >> Also, any other pointers for on recovering remaining 10% loss. When I >>> run benchmark on 8.4 index with 8.5.1 code, performance is very similar to >>> 8.4. >>> >> >>> >> Thanks, >>> >> Viral Gandhi >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>>
