Re: [PR] Implement off-heap quantized scoring [lucene]

via GitHub Tue, 09 Sep 2025 23:43:48 -0700


kaivalnp commented on PR #14863:
URL: https://github.com/apache/lucene/pull/14863#issuecomment-3273523849


   Sorry for the delay here!
   
   I ran the following benchmarks on 768d Cohere vectors for all vector 
similarities, with 4-bit (compressed) and 7-bit quantization. I needed to run 
10k queries for reliable results (saw some variance in the default case of 1k 
queries)
   
   ### `cosine`
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.544        3.103   3.102        0.999  200000   100      50       32      
  200     4 bits     14.45      13842.75             4          670.05       
659.943       74.005       HNSW
    0.505        4.499   4.497        1.000  200000   100      50       32      
  200     7 bits     14.03      14257.20             4          745.36       
733.185      147.247       HNSW
   ```
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.543        2.854   2.852        0.999  200000   100      50       32      
  200     4 bits     14.57      13724.95             4          670.06       
659.943       74.005       HNSW
    0.506        3.978   3.976        0.999  200000   100      50       32      
  200     7 bits     13.41      14912.02             4          745.09       
733.185      147.247       HNSW
   ```
   
   ### `dot_product`
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.528        3.522   3.520        1.000  200000   100      50       32      
  200     4 bits     14.03      14258.22             4          674.69       
659.943       74.005       HNSW
    0.881        4.303   4.301        1.000  200000   100      50       32      
  200     7 bits     14.41      13880.21             4          746.41       
733.185      147.247       HNSW
   ```
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.528        3.218   3.217        1.000  200000   100      50       32      
  200     4 bits     13.60      14706.96             4          674.64       
659.943       74.005       HNSW
    0.882        3.915   3.913        1.000  200000   100      50       32      
  200     7 bits     15.15      13205.68             4          746.44       
733.185      147.247       HNSW
   ```
   
   ### `euclidean`
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.550        7.581   7.579        1.000  200000   100      50       32      
  200     4 bits     13.09      15284.68             4          667.46       
659.943       74.005       HNSW
    0.936        3.938   3.937        1.000  200000   100      50       32      
  200     7 bits     12.88      15532.77             4          739.76       
733.185      147.247       HNSW
   ```
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.550        2.422   2.420        0.999  200000   100      50       32      
  200     4 bits     13.27      15070.45             4          667.45       
659.943       74.005       HNSW
    0.936        3.666   3.664        0.999  200000   100      50       32      
  200     7 bits     12.66      15796.54             4          739.73       
733.185      147.247       HNSW
   ```
   
   ### `mip`
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.529        3.537   3.536        1.000  200000   100      50       32      
  200     4 bits     14.30      13988.95             4          674.69       
659.943       74.005       HNSW
    0.882        4.280   4.278        1.000  200000   100      50       32      
  200     7 bits     14.18      14109.35             4          746.41       
733.185      147.247       HNSW
   ```
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.529        3.332   3.330        0.999  200000   100      50       32      
  200     4 bits     13.89      14401.96             4          674.65       
659.943       74.005       HNSW
    0.882        3.876   3.874        0.999  200000   100      50       32      
  200     7 bits     13.87      14423.77             4          746.43       
733.185      147.247       HNSW
   ```
   
   The speedup vector search time for 4 bit `euclidean` (=68%) seems amazing, 
because we used to decompress the bits into a `byte` and use the same 
[`squareDistance`](https://github.com/apache/lucene/blob/50a4f1864ef98f48abcfdd5202bd96693ee8b098/lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L789-L792)
 function, which did not take into account that the max value of the inputs 
could be in the \[0, 15\] range, and we can make some optimizations with this 
information.
   
   We see \~10% speedup in search time for everything else, while indexing is 
kind of unaffected.
   
   Sharing JMH benchmarks (also because it checks for correctness of functions):
   
   ```
   java --module-path lucene/benchmark-jmh/build/benchmarks --module 
org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte*" -p 
size=1024
   ```
   
   ```
   Benchmark                                                       (size)   
Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  
thrpt   15   2.378 ± 0.001  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  
thrpt   15   0.472 ± 0.002  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  
thrpt   15   2.378 ± 0.002  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  
thrpt   15   2.448 ± 0.005  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15  16.180 ± 0.082  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  
thrpt   15  20.947 ± 0.045  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  
thrpt   15   1.642 ± 0.001  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  
thrpt   15  14.142 ± 0.031  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  
thrpt   15   2.463 ± 0.003  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  
thrpt   15   2.022 ± 0.001  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15  16.340 ± 0.039  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  
thrpt   15  18.749 ± 0.055  ops/us
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Implement off-heap quantized scoring [lucene]

Reply via email to