dmatth1 opened a new pull request, #50030:
URL: https://github.com/apache/arrow/pull/50030

   ### Rationale for this change
     
     `BlockSplitBloomFilter::FindHash` ships the scalar reference probe — an 
8-iteration short-circuit loop. The short-circuit blocks autovectorization, and 
on miss-heavy workloads (Parquet row-group skipping) the per-lane 
branch-mispredict dominates probe latency.
   
     Closes #50026. Dev list discussion: 
https://lists.apache.org/thread/omof0fq47tndfd80g5hwp2bvjmzvpb40. Sibling 
change in Rust: apache/arrow-rs#10011.
   
     ### What changes are included in this PR?
     
     - Rewrite `FindHash` as a branchless OR-accumulator reduction. The new 
shape autovectorizes to SSE on x86 and NEON on aarch64 at the baseline.
     - Add `bloom_filter_avx2.cc` (xsimd kernel built with `-mavx2`) behind 
`CpuInfo`-based `DynamicDispatch`, mirroring the existing 
`level_comparison_avx2` pattern. xsimd was a requirement from the dev thread; 
the AVX2 target spells the reduction explicitly because gcc/MSVC don't lower 
the autovec body to a single `vptest`.
     - No on-disk format change, no public API change, and bit-identical to the 
scalar reference.
     - Insert path uses the same loop shape and will land in a follow-up PR.
   
     ### Performance
   
     End-to-end `FindHash` via `parquet/benches/bloom_filter_benchmark.cc`.
     
     **M1** (Apple clang -O3, NEON via autovec, 10 reps, CV ≤ 0.4%):
   
     | Bench | upstream/main | this PR | Speedup |
     |---|---:|---:|---:|
     | `BM_FindExistingHash` (hit-heavy) | 3.85 ns/probe | 2.41 ns/probe | 
**1.60×** |
     | `BM_FindNonExistingHash` (miss-heavy) | 9.04 ns/probe | 2.41 ns/probe | 
**3.75×** |
   
     **x86-64** (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV ≤ 0.6%):
   
     | Bench | upstream/main | this PR | Speedup |
     |---|---:|---:|---:|
     | `BM_FindExistingHash` (hit-heavy) | 8.62 ns/probe | 4.32 ns/probe | 
**2.00×** |
     | `BM_FindNonExistingHash` (miss-heavy) | 15.29 ns/probe | 4.33 ns/probe | 
**3.53×** |
   
     The scalar miss path stalls on the data-dependent early-exit (slower than 
its own hit path on both archs); the branchless reduction is constant-time 
across hit/miss. `InsertHash`, `BatchInsertHash`, `ComputeHash`, 
`BatchComputeHash` all
     unchanged (16 benches within ±0.6%, inside CV).
   
     ### Are these changes tested?
   
     Yes. New `BloomFilterProbeKernel` test calls both dispatch targets 
directly across 20K random blocks + 200 production-populated blocks per CI run, 
asserting bit-identical output. `DynamicDispatch` resolves once at static init, 
so without this
     test the un-picked target would never be exercised in CI.
   
     Existing `BasicTest`, `FPPTest`, and `CompatibilityTest` continue to pass 
on both the scalar baseline and the AVX2 dispatch path.
   
     ### Are there any user-facing changes?
   
     No. Read-path implementation change only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to