Thank you! Opened https://issues.apache.org/jira/browse/LUCENE-9378 to address this.
Viral Gandhi On Wed, 20 May 2020 at 15:27, Michael McCandless <[email protected]> wrote: > I think we could do this at the Codec level? > > For example, for stored fields, the current default format > (Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and > Mode.BEST_COMPRESSION, that are easy for the user to pick. Both modes use > compression, just at varying levels. > > I think for the (new) Lucene84DocValuesFormat, which looks like it will > always compress binary DVs, we could similarly add a Mode, maybe with two > options, COMPRESSED and UNCOMPRESSED? > > This way it is fairly simple for users to create a custom Codec > subclassing the default Codec and pick the format they want. And we can > try to figure out which way it should default. Our (Amazon's customer > facing product search) usage is admittedly unusual, heavily relying on > BINARY doc values performance per hit collected during matching. Other > search applications might not see a 40% hit to their red-line throughput :) > > Viral could you please open a Jira issue to find a way to make this > configurable? We can hash out the details on the issue ... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[email protected]> > wrote: > >> I guess the compression we added to binary doc values, and for >> postings, seems to have hurt performance in a way that wasn't detected >> in testing when those changes were made, or if it was detected, I >> don't recall any discussion about the tradeoff being made. Now that we >> do see there is a tradeoff, I think we need to have that discussion >> though. I can see that having compression can be a nice win for >> indexes that are huge and may be memory bound, since it can help avoid >> I/O, but for a low-latency case where the index is already memory >> resident, we are willing to pay the price of a larger index to avoid >> the cost of decompression. I think we need to find some way of >> handling both cases. I think our design principle should be to expose >> as few knobs as we can, but in this case I don't see how the code can >> make the decision whether to compress or not, since it really depends >> on external design considerations (how big will the index grow? how >> much RAM will the servers have? what query latency is tolerable?) >> Given that, I think we should find a way to expose some kind of >> configurability. Maybe as a first step, rather than making this >> configurable for each DocValuesType, we could offer a global >> configuration in IndexWriterConfig (compressFields=true/false)? >> >> On Tue, May 19, 2020 at 1:05 AM David Smiley <[email protected]> >> wrote: >> > >> > I don't have a direct answer for you, but your message causes me to >> reflect on how Lucene does *not* give users choice of format on a per-type >> basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is >> annoying. Ideally the previous simple format would be available for you to >> choose, but it is not. Lucene lets you mix & match PostingsFormats, stored >> fields formats, term vectors formats, points format. But when it comes to >> DocValues, it's an all-encompassing format for five different structures. >> So you take it or leave it; all or nothing. My colleague filed >> https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel >> free to comment there with your opinion if you have one. >> > >> > ~ David >> > >> > >> > On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[email protected]> >> wrote: >> >> >> >> Hi, >> >> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal >> benchmarking. We noticed that with this upgrade our QPS dropped more than >> 40% and also affected latencies. After doing some profiling and reverting >> LUCENE-9211 commit related to BinaryDocValues compression, we recovered >> ~30% of the loss. Did anyone encounter similar situation? >> >> >> >> We rely on BinaryDocValues very heavily. Should this newly introduced >> compression be optional to opt-in? >> >> >> >> Also, any other pointers for on recovering remaining 10% loss. When I >> run benchmark on 8.4 index with 8.5.1 code, performance is very similar to >> 8.4. >> >> >> >> Thanks, >> >> Viral Gandhi >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
