costin opened a new pull request, #16283:
URL: https://github.com/apache/lucene/pull/16283

   When the stored values have fixed cardinality and no encoding transforms (no 
gcd, delta, table, or block compression), the vectorization provider loads N 
values into a SIMD vector, performs a broadcast range check (`>= min` AND `<= 
max`), collapses per-lane results into a per-doc mask, and OR-writes matching 
docs into the bitset in one operation.
   
   Falls back to scalar when `vectorLen % cardinality != 0` (e.g. vpd=8 on AVX2 
with 4-lane vectors).
   
   ### Benchmark
   
   SortedNumericDocValuesRangeQueryBenchmark, 1M docs, `cardinality=fixed`, 
`density=dense`, `queryShape=plain`. Branch vs `main`, JDK 25.0.3.
   
   #### AMD EPYC 7R32 (c5a.2xlarge) — AVX2, 256-bit (4 longs)
   
   SIMD: vpd=2 (2 docs/vec), vpd=4 (1 doc/vec). vpd=8 falls back to scalar.
   
   | vpd | pattern | selectivity | baseline (ops/s) | candidate (ops/s) | ratio 
|
   |---:|---|---:|---:|---:|---:|
   | 2 | clustered | 0.01 | 15426.5 | 15565.0 | 1.01x |
   | 2 | clustered | 0.1 | 15331.1 | 15431.7 | 1.01x |
   | 2 | clustered | 0.5 | 19083.7 | 19168.8 | 1.00x |
   | 2 | random | 0.01 | 65.2 | 134.1 | **2.06x** |
   | 2 | random | 0.1 | 58.7 | 95.8 | **1.63x** |
   | 2 | random | 0.5 | 65.1 | 86.7 | **1.33x** |
   | 4 | clustered | 0.01 | 9830.6 | 9843.3 | 1.00x |
   | 4 | clustered | 0.1 | 9676.3 | 9852.7 | 1.02x |
   | 4 | clustered | 0.5 | 18409.4 | 18256.0 | 0.99x |
   | 4 | random | 0.01 | 52.3 | 70.2 | **1.34x** |
   | 4 | random | 0.1 | 46.3 | 49.7 | **1.07x** |
   | 4 | random | 0.5 | 53.2 | 64.3 | **1.21x** |
   | 8 | clustered | 0.01 | 5925.1 | 5891.1 | 0.99x |
   | 8 | clustered | 0.1 | 5830.6 | 5845.9 | 1.00x |
   | 8 | clustered | 0.5 | 17669.9 | 17599.9 | 1.00x |
   | 8 | random | 0.01 | 42.1 | 43.7 | 1.04x |
   | 8 | random | 0.1 | 38.6 | 40.9 | 1.06x |
   | 8 | random | 0.5 | 44.2 | 50.2 | **1.14x** |
   
   #### Intel Xeon 8375C (c6i.2xlarge) — AVX-512, 512-bit (8 longs)
   
   SIMD: vpd=2 (4 docs/vec), vpd=4 (2 docs/vec), vpd=8 (1 doc/vec).
   
   | vpd | pattern | selectivity | baseline (ops/s) | candidate (ops/s) | ratio 
|
   |---:|---|---:|---:|---:|---:|
   | 2 | clustered | 0.01 | 19255.8 | 19439.4 | 1.01x |
   | 2 | clustered | 0.1 | 18689.0 | 19133.6 | 1.02x |
   | 2 | clustered | 0.5 | 22700.0 | 22745.1 | 1.00x |
   | 2 | random | 0.01 | 83.3 | 208.4 | **2.50x** |
   | 2 | random | 0.1 | 67.6 | 133.4 | **1.97x** |
   | 2 | random | 0.5 | 65.7 | 127.8 | **1.94x** |
   | 4 | clustered | 0.01 | 11715.4 | 11658.7 | 1.00x |
   | 4 | clustered | 0.1 | 11791.0 | 11727.0 | 0.99x |
   | 4 | clustered | 0.5 | 20998.0 | 21080.8 | 1.00x |
   | 4 | random | 0.01 | 63.5 | 104.1 | **1.64x** |
   | 4 | random | 0.1 | 50.9 | 65.9 | **1.29x** |
   | 4 | random | 0.5 | 61.5 | 94.7 | **1.54x** |
   | 8 | clustered | 0.01 | 7133.1 | 7202.5 | 1.01x |
   | 8 | clustered | 0.1 | 6956.8 | 6995.9 | 1.01x |
   | 8 | clustered | 0.5 | 20338.1 | 20369.2 | 1.00x |
   | 8 | random | 0.01 | 47.2 | 53.5 | **1.13x** |
   | 8 | random | 0.1 | 43.2 | 38.1 | **0.88x** |
   | 8 | random | 0.5 | 51.6 | 54.1 | 1.05x |
   
   Clustered data shows no change since sequential access is already at L1/L2 
cache speed; comparison cost is negligible. Wins appear on random data where 
per-doc cache misses dominate and SIMD batching amortizes comparison overhead. 
   
   Gains scale with `docsPerVector`: vpd=2 on AVX-512 processes 4 docs per 
vector (best), vpd=8 on AVX2 falls back to scalar (no gain).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to