benwtrent commented on PR #15564:
URL: https://github.com/apache/lucene/pull/15564#issuecomment-3744384663

   > Would it be worth adding benchmarks for int4DibitDotProduct? I think it 
would be interesting to compare this to both 1 and 4 bit representations. I 
think we're right on the edge where it might be worth comparing a 2 bit doc and 
an 8 bit query -- using more bits doesn't help much at 1 bit but I feel like 
there's a chance it might be different at 2 bits.
   
   Here is a quick JMH. The 8bit query is also transposed. Honestly, it seems 
the main reason this transposition gives us benefits in Panama Vector land is 
that masking and summing have been stupid slow on Panama in the past. Maybe we 
should revisit this on larger bit sized queries. But for now, I kept parity. I 
am seeing about running an end-to-end recall focused test to see how it stacks 
up against dibit-nibble. 
   
   ```
   Benchmark                                                         (size)   
Mode  Cnt   Score   Error   Units
   ScalarQuantizationDotProductBenchmark.int4BitDotProductScalar       1024  
thrpt   15  13.507 ± 0.564  ops/us
   ScalarQuantizationDotProductBenchmark.int4BitDotProductVector       1024  
thrpt   15  51.669 ± 1.192  ops/us
   ScalarQuantizationDotProductBenchmark.int4DibitDotProductScalar     1024  
thrpt   15   6.949 ± 0.113  ops/us
   ScalarQuantizationDotProductBenchmark.int4DibitDotProductVector     1024  
thrpt   15  25.942 ± 1.250  ops/us
   ScalarQuantizationDotProductBenchmark.int4DotProductPackedScalar    1024  
thrpt   15   2.875 ± 0.070  ops/us
   ScalarQuantizationDotProductBenchmark.int4DotProductPackedVector    1024  
thrpt   15   2.440 ± 0.075  ops/us
   ScalarQuantizationDotProductBenchmark.int7DotProductScalar          1024  
thrpt   15   2.875 ± 0.035  ops/us
   ScalarQuantizationDotProductBenchmark.int7DotProductVector          1024  
thrpt   15   6.224 ± 0.137  ops/us
   ScalarQuantizationDotProductBenchmark.int8DibitDotProductScalar     1024  
thrpt   15   3.490 ± 0.060  ops/us
   ScalarQuantizationDotProductBenchmark.int8DibitDotProductVector     1024  
thrpt   15  12.860 ± 0.531  ops/us
   ScalarQuantizationDotProductBenchmark.uint8DotProductScalar         1024  
thrpt   15   2.879 ± 0.060  ops/us
   ScalarQuantizationDotProductBenchmark.uint8DotProductVector         1024  
thrpt   15   6.105 ± 0.299  ops/us
   ```
   
   I really feel like we are leaving perf on the ground here. But maybe the 
administration and scaling of HNSW buys us enough (e.g. dibit-byte might 
explore less as graph quality and scores are better...).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to