dmatth1 opened a new pull request, #50030:
URL: https://github.com/apache/arrow/pull/50030
### Rationale for this change
`BlockSplitBloomFilter::FindHash` ships the scalar reference probe — an
8-iteration short-circuit loop. The short-circuit blocks autovectorization, and
on miss-heavy workloads (Parquet row-group skipping) the per-lane
branch-mispredict dominates probe latency.
Closes #50026. Dev list discussion:
https://lists.apache.org/thread/omof0fq47tndfd80g5hwp2bvjmzvpb40. Sibling
change in Rust: apache/arrow-rs#10011.
### What changes are included in this PR?
- Rewrite `FindHash` as a branchless OR-accumulator reduction. The new
shape autovectorizes to SSE on x86 and NEON on aarch64 at the baseline.
- Add `bloom_filter_avx2.cc` (xsimd kernel built with `-mavx2`) behind
`CpuInfo`-based `DynamicDispatch`, mirroring the existing
`level_comparison_avx2` pattern. xsimd was a requirement from the dev thread;
the AVX2 target spells the reduction explicitly because gcc/MSVC don't lower
the autovec body to a single `vptest`.
- No on-disk format change, no public API change, and bit-identical to the
scalar reference.
- Insert path uses the same loop shape and will land in a follow-up PR.
### Performance
End-to-end `FindHash` via `parquet/benches/bloom_filter_benchmark.cc`.
**M1** (Apple clang -O3, NEON via autovec, 10 reps, CV ≤ 0.4%):
| Bench | upstream/main | this PR | Speedup |
|---|---:|---:|---:|
| `BM_FindExistingHash` (hit-heavy) | 3.85 ns/probe | 2.41 ns/probe |
**1.60×** |
| `BM_FindNonExistingHash` (miss-heavy) | 9.04 ns/probe | 2.41 ns/probe |
**3.75×** |
**x86-64** (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV ≤ 0.6%):
| Bench | upstream/main | this PR | Speedup |
|---|---:|---:|---:|
| `BM_FindExistingHash` (hit-heavy) | 8.62 ns/probe | 4.32 ns/probe |
**2.00×** |
| `BM_FindNonExistingHash` (miss-heavy) | 15.29 ns/probe | 4.33 ns/probe |
**3.53×** |
The scalar miss path stalls on the data-dependent early-exit (slower than
its own hit path on both archs); the branchless reduction is constant-time
across hit/miss. `InsertHash`, `BatchInsertHash`, `ComputeHash`,
`BatchComputeHash` all
unchanged (16 benches within ±0.6%, inside CV).
### Are these changes tested?
Yes. New `BloomFilterProbeKernel` test calls both dispatch targets
directly across 20K random blocks + 200 production-populated blocks per CI run,
asserting bit-identical output. `DynamicDispatch` resolves once at static init,
so without this
test the un-picked target would never be exercised in CI.
Existing `BasicTest`, `FPPTest`, and `CompatibilityTest` continue to pass
on both the scalar baseline and the AVX2 dispatch path.
### Are there any user-facing changes?
No. Read-path implementation change only.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]