[ https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140163#comment-17140163 ]
Alex Klibisz commented on LUCENE-9378: -------------------------------------- [~jpountz] It's about 2x slower. I re-ran a benchmark to be sure. Here is the setup: * Storing a corpus of 18K binary vectors in a single shard. * Each vector contains ~500 ints denoting the positive indices. So each one is storing a bytearray of 500 * 4 = 2000 bytes in the binary doc values. * Running 2000 serial searches against these vectors. Each search reads, deserializes, and computes the Jaccard similarity against every vector in the corpus. So a total of 18K * 2K reads from the shard. * The read order is defined by Elasticsearch. Internally I'm using a FunctionScoreQuery, code here: [https://github.com/alexklibisz/elastiknn/blob/5246a26f76791362482a98066e31071cb03e0a74/plugin/src/main/scala/com/klibisz/elastiknn/query/ExactQuery.scala#L22-L29] * Ubuntu 20 on an Intel i7-8750H 2.20GHz x 12cores * Running Oracle Jdk 14 : ``` $ java -version java version "14" 2020-03-17 Java(TM) SE Runtime Environment (build 14+36-1461) Java HotSpot(TM) 64-Bit Server VM (build 14+36-1461, mixed mode, sharing) ``` * Running all 2000 searches once, then again, and reporting the time from second run (JVM warmup, etc.). Results: * Using Elasticsearch 7.6.2 w/ Lucene 8.4.0: ** 212 seconds for 2000 searches ** Search threads spend 95.5% of time computing similarities, 0.2% in the LZ4.decompress() method. * Using Elasticsearch 7.7.1 w/ Lucene 8.5.1: ** 445 seconds for 2000 searches ** Search threads spend 56% of total time computing similarities, 42% in the decompress method. VisualVM screenshot for 7.6.x: !hotspots-v76x.png! VisualVM screenshot for 7.7.x: !hotspots-v77x.png! Attaching snapshots from VisualVM: [^snapshots-v76x.nps] [^snapshot-v77x.nps] Thank you all for your help! :) > Configurable compression for BinaryDocValues > -------------------------------------------- > > Key: LUCENE-9378 > URL: https://issues.apache.org/jira/browse/LUCENE-9378 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Viral Gandhi > Priority: Minor > Attachments: hotspots-v76x.png, hotspots-v76x.png, hotspots-v76x.png, > hotspots-v76x.png, hotspots-v76x.png, hotspots-v77x.png, hotspots-v77x.png, > hotspots-v77x.png, hotspots-v77x.png, image-2020-06-12-22-17-30-339.png, > image-2020-06-12-22-17-53-961.png, image-2020-06-12-22-18-24-527.png, > image-2020-06-12-22-18-48-919.png, snapshot-v77x.nps, snapshot-v77x.nps, > snapshot-v77x.nps, snapshots-v76x.nps, snapshots-v76x.nps, snapshots-v76x.nps > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Lucene 8.5.1 includes a change to always [compress > BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This > caused (~30%) reduction in our red-line QPS (throughput). > We think users should be given some way to opt-in for this compression > feature instead of always being enabled which can have a substantial query > time cost as we saw during our upgrade. [~mikemccand] suggested one possible > approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and > UNCOMPRESSED) and allowing users to create a custom Codec subclassing the > default Codec and pick the format they want. > Idea is similar to Lucene50StoredFieldsFormat which has two modes, > Mode.BEST_SPEED and Mode.BEST_COMPRESSION. > Here's related issues for adding benchmark covering BINARY doc values > query-time performance - [https://github.com/mikemccand/luceneutil/issues/61] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org