PR is up: https://github.com/apache/arrow/pull/50030
As discussed: xsimd kernel for the AVX2 dispatch target, runtime dispatch through CpuInfo (level_comparison_avx2 pattern), branchless autovec body shared across SSE/NEON at the baseline. On-disk format unchanged. Insert path will follow. Numbers (probe latency, end-to-end through FindHash): - M1 NEON: 1.60x hit / 3.75x miss - x86 AVX2 (gcc 13.3): 2.00x hit / 3.53x miss Bit-identical to scalar and cross-target diff test gates drift in CI. - Dan El El sáb, may 23, 2026 a la(s) 1:57 p.m., Dan Mattheiss < [email protected]> escribió: > Hi all, > > Quick update! For this kernel we may not need intrinsics or xsimd. > Rewriting the scalar FindHash loop branchless - drop the early return > false, accumulate miss |= (~blk[i] & mask), return miss == 0 - gets gcc, > clang, and MSVC to autovectorize plain C++ into the same probe (clang fuses > to the exact vptest the intrinsics used; gcc/MSVC do a vpor reduce). It’s > bit-identical to the scalar reference, and on a same-machine A/B it’s > within noise of the hand-written AVX2 on clang in every regime; gcc matches > except ~0.79× in the out-of-L3 throughput regime. The same source also > autovectorizes to NEON (the compiler does the 4-lane block split), so it > drops straight into the CpuInfo-dispatched AVX2 TU with no per-arch kernel. > > Repo: > > https://github.com/dmatth1/quickbloom/tree/parquet-SBBF-autovec/investigations/autovectorize > > This addresses most of the concerns I had so I’ll probably go ahead and > open a PR barring any disagreement! > > - Dan > > El El dom, may 17, 2026 a la(s) 12:35 p.m., Dan Mattheiss < > [email protected]> escribió: > >> Hi Xuwei, >> >> Thanks for the response. I looked at bloom_filter.h, bloom_filter.cc, and >> bloom_filter_avx2.cc under cpp/src/arrow/acero/. The two are different >> algorithms so sharing the SIMD primitive doesn’t really work: >> >> | |Acero BlockedBloomFilter |Parquet SBBF (spec) | >> >> |--------------|----------------------------------------|----------------------------------------------| >> |Block size |64-bit |256-bit (one `__m256i`) | >> |K |~4 |8 (one per 32-bit lane) | >> |Mask |precomputed mask table + rotation |deterministic `(key * SALT[i]) >> >> 27` per lane| >> |SALT |none |spec-defined 8 × uint32 | >> |Format |in-memory only |on-disk, locked by Parquet spec | >> |AVX2 reduction|gather-based load + comparison reduction|`vptest` on the >> loaded block | >> >> The Acero primitive can’t really be wired into >> BlockSplitBloomFilter::FindHash without breaking the Parquet on-disk >> format. They’re different algorithms, different bit-level outputs, with >> different cross-engine compatibility constraints. >> >> What I’d carry over from Acero is the file-organisation and dispatch >> pattern: cpp/src/parquet/bloom_filter_avx2.cc behind >> arrow::internal::CpuInfo::GetInstance()->IsSupported(CpuInfo::AVX2), >> matching how bloom_filter_avx2.cc is structured in Acero today. Helpful >> precedent for the xsimd-vs-raw-intrinsics question in the parent thread - >> Acero uses raw intrinsics; xsimd is the preference for newer code, and >> either is technically viable for the Parquet path. >> >> On “probe isn’t a bottleneck” for arrow-cpp’s scenario — agreed, and the >> case I’m making is mostly about the readers downstream of arrow-cpp rather >> than arrow-cpp’s own workloads. pyarrow / pandas use arrow-cpp via the >> Python bindings; ClickHouse vendors arrow-cpp as a contrib submodule and >> its non-native Parquet path goes through it. Velox’s findHash is a >> near-line-for-line port of arrow-cpp’s FindHash (same loop, same magic >> numbers, only naming differs); arrow-rs implements the same SBBF algorithm >> in Rust. Landing the AVX2 lowering here makes it the natural reference for >> the others to follow. >> >> Plan: PR against cpp/src/parquet/bloom_filter.cc along these lines, >> structured as bloom_filter_avx2.cc behind CpuInfo dispatch matching Acero’s >> pattern. Will incorporate Antoine’s three open questions (xsimd vs raw >> intrinsics, NEON scope, probe + insert) into the PR description once those >> settle in the parent thread. >> >> Regards, >> Dan >> >> El El dom, may 17, 2026 a la(s) 1:26 a.m., wish maple < >> [email protected]> escribió: >> >>> This is generally ok to me. I didn't write a SIMD probe because in our >>> scenario the probe is not the bottleneck. >>> But welcome to contribute to it! >>> >>> I also noticed that in arrow-cpp, it has a SIMD-implemented SBBF. I don't >>> know whether it's ok to share a >>> same implementation or extract it out. >>> >>> Best, >>> Xuwei Fu >>> >>> Dan Mattheiss <[email protected]> 于2026年5月17日周日 04:21写道: >>> >>> > Hi Antoine, >>> > >>> > Done — xsimd port is in branch parquet-xsimd-port of the repo, >>> diff-test >>> > bit-identical to scalar (167M pairs, 0 mismatches), builds clean under >>> > -Wall -Wextra. >>> > >>> > 20-rep medians on Sapphire Rapids @ 2.1 GHz (virtualised), post-hash >>> probe >>> > (XXH64 computed at setup, excluded from the timed loop): >>> > >>> > |Regime |scalar|xsimd single-key|xsimd 4-way bulk| >>> > |------------------|-----:|---------------:|---------------:| >>> > |Small (in-cache) |12.35 |2.48 (5.0×) |1.73 (7.1×) | >>> > |Medium (out-of-L3)|18.40 |7.41 (2.5×) |7.17 (2.6×) | >>> > |Large (deep DRAM) |31.05 |22.10 (1.4×) |18.67 (1.7×) | >>> > >>> > Same-machine A/B against raw _mm256_* intrinsics is within noise across >>> > every regime: -3% to +8%, with the +8% only on the single-key in-cache >>> > path. The xsimd form has one extra vpandn vs raw testc (which folds the >>> > andnot into vptest), which accounts for the ~1 cycle on the in-cache >>> hot >>> > path. A direct xsimd::testc(batch, batch) would close it. >>> > >>> > One finding worth recording for the PR description. The natural xsimd >>> > translation of _mm256_testc_si256 — xsimd::all((blk & mask) == mask) — >>> > lowers to 5 instructions through a vpcmpeqd + vpcmpeqd(allones) + >>> vptest >>> > chain rather than a single vptest. xsimd’s public testc(batch, batch) >>> > doesn’t exist, but xsimd::none(as_batch_bool(bitwise_andnot(mask, >>> blk))) >>> > lowers to vpandn + vptest — one more op than raw testc but otherwise >>> the >>> > same shape. That’s the form in the current branch. >>> > >>> > A small upstream xsimd::testc(batch, batch) PR would be a nice cleanup >>> > (folds the andnot into the vptest, recovers the ~1 cycle noted above) >>> but >>> > the current form is already within noise of the raw intrinsics version. >>> > >>> > Three things I’d like input on: >>> > 1. NEON in scope for the initial Arrow PR, or follow-up? The reduction >>> > translates fine via xsimd::none, but the 8-lane Parquet block on 4-lane >>> > NEON batches still needs an algorithm split, and I don’t have ARM >>> hardware >>> > to bench on. >>> > 2. Probe + insert in one PR, or probe first? >>> > 3. Anything specific you’d want to see in the PR description / commit >>> > message for the xsimd lowering note given it’s a non-obvious result? >>> > >>> > Repo: https://github.com/dmatth1/quickbloom (branch >>> parquet-xsimd-port; >>> > see >>> > investigations/parquet_port/XSIMD_PORT.md for the full A/B, codegen >>> > analysis, and earlier-attempts table) >>> > >>> > Regards, >>> > Dan >>> > >>> > El El sáb, may 16, 2026 a la(s) 1:36 a.m., Antoine Pitrou < >>> > [email protected]> escribió: >>> > >>> > > >>> > > Hi Dan, >>> > > >>> > > Thanks for proposing this. >>> > > >>> > > The #1 requirement I have is to use xsimd for this, like we do for >>> other >>> > > SIMD optimizations in Arrow. >>> > > https://xsimd.readthedocs.io/ >>> > > >>> > > Regards >>> > > >>> > > Antoine. >>> > > >>> > > >>> > > Le 16/05/2026 à 02:55, Dan Mattheiss a écrit : >>> > > > Hello everyone, >>> > > > >>> > > > *cc @mapleFU and @pitrou, who’ve been the most active reviewers on >>> > > > parquet-cpp bloom filter code recently.* >>> > > > >>> > > > arrow-go shipped AVX2/SSE4/NEON SBBF probes in 18.3.0 >>> > > > (apache/arrow-go#336). Impala and Kudu have had AVX2 SBBF probes >>> for >>> > > years. >>> > > > arrow-cpp still ships the scalar reference in >>> > > > BlockSplitBloomFilter::FindHash. I’d like to discuss porting the >>> > arrow-go >>> > > > approach to C++ before opening a PR. This is a read-path >>> implementation >>> > > > change with no on-disk format implications, so I’m raising it here >>> > rather >>> > > > than on parquet-dev. >>> > > > >>> > > > >>> > > > *Scope* >>> > > > On-disk format unchanged. SALT, XXH64, bucket index unchanged. >>> Only the >>> > > > probe implementation changes — an AVX2 variant selected at runtime >>> via >>> > > > arrow::internal::CpuInfo, with the existing scalar path retained >>> for >>> > > > non-AVX2 builds. No API change, no behavior change for any reader. >>> > > > >>> > > > *The change* >>> > > > About a dozen instructions, no branches: >>> > > > >>> > > > bool find_avx2(uint64_t hash) const { >>> > > > uint32_t bucket_index = uint32_t(((hash >> 32) * num_blocks_) >> >>> 32); >>> > > > uint32_t key = uint32_t(hash); >>> > > > >>> > > > __m256i hash_v = _mm256_set1_epi32(int32_t(key)); >>> > > > __m256i salt = _mm256_load_si256(reinterpret_cast<const >>> > __m256i*>(SALT)); >>> > > > __m256i prod = _mm256_mullo_epi32(hash_v, salt); >>> > > > __m256i shift = _mm256_srli_epi32(prod, 27); >>> > > > __m256i ones = _mm256_set1_epi32(1); >>> > > > __m256i mask = _mm256_sllv_epi32(ones, shift); >>> > > > >>> > > > const __m256i* blk = reinterpret_cast<const __m256i*>(data_ + >>> > > bucket_index >>> > > > * 32); >>> > > > return _mm256_testc_si256(_mm256_loadu_si256(blk), mask); >>> > > > } >>> > > > >>> > > > >>> > > > _mm256_mullo_epi32(hash_v, salt) produces eight 32-bit products >>> each >>> > > equal >>> > > > to scalar key * SALT[i]. _mm256_testc_si256 returns 1 iff (~block & >>> > mask) >>> > > > == 0 — the same condition as the scalar 8-iteration >>> > AND-with-early-exit. >>> > > > >>> > > > *Correctness* >>> > > > Bit-identical to the existing scalar reference: a diff test >>> against the >>> > > > vendored FindHash from cpp/src/parquet/bloom_filter.cc reports 0 >>> > > mismatches >>> > > > across 167M (query, filter) pairs. The same diff test would run in >>> CI >>> > to >>> > > > guard against drift between the two paths. >>> > > > >>> > > > *Numbers* >>> > > > Intel Xeon @ 2.8 GHz (Cascade Lake-class, 33 MiB L3, virtualised), >>> g++ >>> > > -O3 >>> > > > -mavx2 -mbmi2. Post-hash probe latency; XXH64 cost excluded from >>> the >>> > > timed >>> > > > loop. >>> > > > >>> > > > |Regime |Filter footprint|scalar |AVX2 single-key |AVX2 4-way bulk| >>> > > > >>> > > >>> > >>> |------------------|----------------|--------|----------------|---------------| >>> > > > |Small (in-cache) |0.5 MiB |12.73 ns|3.70 ns (3.4×) |2.36 ns >>> (5.4×) | >>> > > > |Medium (out-of-L3)|128 MiB |35.84 ns|29.18 ns (1.2×) |22.79 ns >>> (1.6×)| >>> > > > |Large (deep DRAM) |1 GiB |41.41 ns|35.90 ns (1.15×)|27.75 ns >>> (1.5×)| >>> > > > >>> > > > In-cache the win is ALU collapse: 8 dependent loads with branches >>> > becomes >>> > > > two SIMD ops after one cache-line load. Out-of-L3 both paths wait >>> for >>> > the >>> > > > same DRAM load and the per-probe gain narrows; the 4-way bulk path >>> > > overlaps >>> > > > four cache misses in flight and recovers most of it. Most useful >>> for >>> > > WHERE >>> > > > col IN (...) and large-table scans across many row-group filters. >>> > > > >>> > > > *Not in scope for this discussion* >>> > > > • AVX-512 evaluated and rejected — frequency throttle on Sapphire >>> > Rapids >>> > > > exceeds the width gain on this workload, and SBBF mask compute >>> doesn’t >>> > > > saturate ZMM. The variants-rejected table in the linked repo has >>> the >>> > > data. >>> > > > • NEON / SVE2 not measured. Intrinsics map cleanly on paper but I >>> don’t >>> > > > want to send hand-wave estimates. Happy to do this as a follow-up >>> if >>> > > > there’s interest. >>> > > > >>> > > > *What I’d like feedback on* >>> > > > 1. Preference on landing probe + insert together, or probe first >>> as a >>> > > > smaller PR? >>> > > > 2. Should NEON block the AVX2 PR, or is it acceptable as a >>> follow-up? >>> > > > 3. Any concerns with the diff-test-in-CI approach as the regression >>> > guard >>> > > > between scalar and AVX2 paths? >>> > > > >>> > > > *Links* >>> > > > • arrow-go prior art: https://github.com/apache/arrow-go/pull/336 >>> > > > • Bench, vendored references, diff test: >>> > > > https://github.com/dmatth1/quickbloom (under >>> > > investigations/parquet_port/) >>> > > > • Longer write-up: https://dmatth1.github.io/posts/parquet-bloom/ >>> > > > >>> > > > Thanks, >>> > > > Dan >>> > > > >>> > > >>> > > >>> > >>> >>
