costin opened a new pull request, #16283: URL: https://github.com/apache/lucene/pull/16283
When the stored values have fixed cardinality and no encoding transforms (no gcd, delta, table, or block compression), the vectorization provider loads N values into a SIMD vector, performs a broadcast range check (`>= min` AND `<= max`), collapses per-lane results into a per-doc mask, and OR-writes matching docs into the bitset in one operation. Falls back to scalar when `vectorLen % cardinality != 0` (e.g. vpd=8 on AVX2 with 4-lane vectors). ### Benchmark SortedNumericDocValuesRangeQueryBenchmark, 1M docs, `cardinality=fixed`, `density=dense`, `queryShape=plain`. Branch vs `main`, JDK 25.0.3. #### AMD EPYC 7R32 (c5a.2xlarge) — AVX2, 256-bit (4 longs) SIMD: vpd=2 (2 docs/vec), vpd=4 (1 doc/vec). vpd=8 falls back to scalar. | vpd | pattern | selectivity | baseline (ops/s) | candidate (ops/s) | ratio | |---:|---|---:|---:|---:|---:| | 2 | clustered | 0.01 | 15426.5 | 15565.0 | 1.01x | | 2 | clustered | 0.1 | 15331.1 | 15431.7 | 1.01x | | 2 | clustered | 0.5 | 19083.7 | 19168.8 | 1.00x | | 2 | random | 0.01 | 65.2 | 134.1 | **2.06x** | | 2 | random | 0.1 | 58.7 | 95.8 | **1.63x** | | 2 | random | 0.5 | 65.1 | 86.7 | **1.33x** | | 4 | clustered | 0.01 | 9830.6 | 9843.3 | 1.00x | | 4 | clustered | 0.1 | 9676.3 | 9852.7 | 1.02x | | 4 | clustered | 0.5 | 18409.4 | 18256.0 | 0.99x | | 4 | random | 0.01 | 52.3 | 70.2 | **1.34x** | | 4 | random | 0.1 | 46.3 | 49.7 | **1.07x** | | 4 | random | 0.5 | 53.2 | 64.3 | **1.21x** | | 8 | clustered | 0.01 | 5925.1 | 5891.1 | 0.99x | | 8 | clustered | 0.1 | 5830.6 | 5845.9 | 1.00x | | 8 | clustered | 0.5 | 17669.9 | 17599.9 | 1.00x | | 8 | random | 0.01 | 42.1 | 43.7 | 1.04x | | 8 | random | 0.1 | 38.6 | 40.9 | 1.06x | | 8 | random | 0.5 | 44.2 | 50.2 | **1.14x** | #### Intel Xeon 8375C (c6i.2xlarge) — AVX-512, 512-bit (8 longs) SIMD: vpd=2 (4 docs/vec), vpd=4 (2 docs/vec), vpd=8 (1 doc/vec). | vpd | pattern | selectivity | baseline (ops/s) | candidate (ops/s) | ratio | |---:|---|---:|---:|---:|---:| | 2 | clustered | 0.01 | 19255.8 | 19439.4 | 1.01x | | 2 | clustered | 0.1 | 18689.0 | 19133.6 | 1.02x | | 2 | clustered | 0.5 | 22700.0 | 22745.1 | 1.00x | | 2 | random | 0.01 | 83.3 | 208.4 | **2.50x** | | 2 | random | 0.1 | 67.6 | 133.4 | **1.97x** | | 2 | random | 0.5 | 65.7 | 127.8 | **1.94x** | | 4 | clustered | 0.01 | 11715.4 | 11658.7 | 1.00x | | 4 | clustered | 0.1 | 11791.0 | 11727.0 | 0.99x | | 4 | clustered | 0.5 | 20998.0 | 21080.8 | 1.00x | | 4 | random | 0.01 | 63.5 | 104.1 | **1.64x** | | 4 | random | 0.1 | 50.9 | 65.9 | **1.29x** | | 4 | random | 0.5 | 61.5 | 94.7 | **1.54x** | | 8 | clustered | 0.01 | 7133.1 | 7202.5 | 1.01x | | 8 | clustered | 0.1 | 6956.8 | 6995.9 | 1.01x | | 8 | clustered | 0.5 | 20338.1 | 20369.2 | 1.00x | | 8 | random | 0.01 | 47.2 | 53.5 | **1.13x** | | 8 | random | 0.1 | 43.2 | 38.1 | **0.88x** | | 8 | random | 0.5 | 51.6 | 54.1 | 1.05x | Clustered data shows no change since sequential access is already at L1/L2 cache speed; comparison cost is negligible. Wins appear on random data where per-doc cache misses dominate and SIMD batching amortizes comparison overhead. Gains scale with `docsPerVector`: vpd=2 on AVX-512 processes 4 docs per vector (best), vpd=8 on AVX2 falls back to scalar (no gain). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
