PR is up: https://github.com/apache/arrow/pull/50030


As discussed: xsimd kernel for the AVX2 dispatch target, runtime dispatch
through CpuInfo (level_comparison_avx2 pattern), branchless autovec body
shared across SSE/NEON at the baseline. On-disk format unchanged. Insert
path will follow.


Numbers (probe latency, end-to-end through FindHash):

  - M1 NEON: 1.60x hit / 3.75x miss

  - x86 AVX2 (gcc 13.3): 2.00x hit / 3.53x miss


Bit-identical to scalar and cross-target diff test gates drift in CI.


- Dan


El El sáb, may 23, 2026 a la(s) 1:57 p.m., Dan Mattheiss <
[email protected]> escribió:

> Hi all,
>
> Quick update! For this kernel we may not need intrinsics or xsimd.
> Rewriting the scalar FindHash loop branchless - drop the early return
> false, accumulate miss |= (~blk[i] & mask), return miss == 0 - gets gcc,
> clang, and MSVC to autovectorize plain C++ into the same probe (clang fuses
> to the exact vptest the intrinsics used; gcc/MSVC do a vpor reduce). It’s
> bit-identical to the scalar reference, and on a same-machine A/B it’s
> within noise of the hand-written AVX2 on clang in every regime; gcc matches
> except ~0.79× in the out-of-L3 throughput regime. The same source also
> autovectorizes to NEON (the compiler does the 4-lane block split), so it
> drops straight into the CpuInfo-dispatched AVX2 TU with no per-arch kernel.
>
> Repo:
>
> https://github.com/dmatth1/quickbloom/tree/parquet-SBBF-autovec/investigations/autovectorize
>
> This addresses most of the concerns I had so I’ll probably go ahead and
> open a PR barring any disagreement!
>
> - Dan
>
> El El dom, may 17, 2026 a la(s) 12:35 p.m., Dan Mattheiss <
> [email protected]> escribió:
>
>> Hi Xuwei,
>>
>> Thanks for the response. I looked at bloom_filter.h, bloom_filter.cc, and
>> bloom_filter_avx2.cc under cpp/src/arrow/acero/. The two are different
>> algorithms so sharing the SIMD primitive doesn’t really work:
>>
>> | |Acero BlockedBloomFilter |Parquet SBBF (spec) |
>>
>> |--------------|----------------------------------------|----------------------------------------------|
>> |Block size |64-bit |256-bit (one `__m256i`) |
>> |K |~4 |8 (one per 32-bit lane) |
>> |Mask |precomputed mask table + rotation |deterministic `(key * SALT[i])
>> >> 27` per lane|
>> |SALT |none |spec-defined 8 × uint32 |
>> |Format |in-memory only |on-disk, locked by Parquet spec |
>> |AVX2 reduction|gather-based load + comparison reduction|`vptest` on the
>> loaded block |
>>
>> The Acero primitive can’t really be wired into
>> BlockSplitBloomFilter::FindHash without breaking the Parquet on-disk
>> format. They’re different algorithms, different bit-level outputs, with
>> different cross-engine compatibility constraints.
>>
>> What I’d carry over from Acero is the file-organisation and dispatch
>> pattern: cpp/src/parquet/bloom_filter_avx2.cc behind
>> arrow::internal::CpuInfo::GetInstance()->IsSupported(CpuInfo::AVX2),
>> matching how bloom_filter_avx2.cc is structured in Acero today. Helpful
>> precedent for the xsimd-vs-raw-intrinsics question in the parent thread -
>> Acero uses raw intrinsics; xsimd is the preference for newer code, and
>> either is technically viable for the Parquet path.
>>
>> On “probe isn’t a bottleneck” for arrow-cpp’s scenario — agreed, and the
>> case I’m making is mostly about the readers downstream of arrow-cpp rather
>> than arrow-cpp’s own workloads. pyarrow / pandas use arrow-cpp via the
>> Python bindings; ClickHouse vendors arrow-cpp as a contrib submodule and
>> its non-native Parquet path goes through it. Velox’s findHash is a
>> near-line-for-line port of arrow-cpp’s FindHash (same loop, same magic
>> numbers, only naming differs); arrow-rs implements the same SBBF algorithm
>> in Rust. Landing the AVX2 lowering here makes it the natural reference for
>> the others to follow.
>>
>> Plan: PR against cpp/src/parquet/bloom_filter.cc along these lines,
>> structured as bloom_filter_avx2.cc behind CpuInfo dispatch matching Acero’s
>> pattern. Will incorporate Antoine’s three open questions (xsimd vs raw
>> intrinsics, NEON scope, probe + insert) into the PR description once those
>> settle in the parent thread.
>>
>> Regards,
>> Dan
>>
>> El El dom, may 17, 2026 a la(s) 1:26 a.m., wish maple <
>> [email protected]> escribió:
>>
>>> This is generally ok to me. I didn't write a SIMD probe because in our
>>> scenario the probe is not the bottleneck.
>>> But welcome to contribute to it!
>>>
>>> I also noticed that in arrow-cpp, it has a SIMD-implemented SBBF. I don't
>>> know whether it's ok to share a
>>> same implementation or extract it out.
>>>
>>> Best,
>>> Xuwei Fu
>>>
>>> Dan Mattheiss <[email protected]> 于2026年5月17日周日 04:21写道:
>>>
>>> > Hi Antoine,
>>> >
>>> > Done — xsimd port is in branch parquet-xsimd-port of the repo,
>>> diff-test
>>> > bit-identical to scalar (167M pairs, 0 mismatches), builds clean under
>>> > -Wall -Wextra.
>>> >
>>> > 20-rep medians on Sapphire Rapids @ 2.1 GHz (virtualised), post-hash
>>> probe
>>> > (XXH64 computed at setup, excluded from the timed loop):
>>> >
>>> > |Regime |scalar|xsimd single-key|xsimd 4-way bulk|
>>> > |------------------|-----:|---------------:|---------------:|
>>> > |Small (in-cache) |12.35 |2.48 (5.0×) |1.73 (7.1×) |
>>> > |Medium (out-of-L3)|18.40 |7.41 (2.5×) |7.17 (2.6×) |
>>> > |Large (deep DRAM) |31.05 |22.10 (1.4×) |18.67 (1.7×) |
>>> >
>>> > Same-machine A/B against raw _mm256_* intrinsics is within noise across
>>> > every regime: -3% to +8%, with the +8% only on the single-key in-cache
>>> > path. The xsimd form has one extra vpandn vs raw testc (which folds the
>>> > andnot into vptest), which accounts for the ~1 cycle on the in-cache
>>> hot
>>> > path. A direct xsimd::testc(batch, batch) would close it.
>>> >
>>> > One finding worth recording for the PR description. The natural xsimd
>>> > translation of _mm256_testc_si256 — xsimd::all((blk & mask) == mask) —
>>> > lowers to 5 instructions through a vpcmpeqd + vpcmpeqd(allones) +
>>> vptest
>>> > chain rather than a single vptest. xsimd’s public testc(batch, batch)
>>> > doesn’t exist, but xsimd::none(as_batch_bool(bitwise_andnot(mask,
>>> blk)))
>>> > lowers to vpandn + vptest — one more op than raw testc but otherwise
>>> the
>>> > same shape. That’s the form in the current branch.
>>> >
>>> > A small upstream xsimd::testc(batch, batch) PR would be a nice cleanup
>>> > (folds the andnot into the vptest, recovers the ~1 cycle noted above)
>>> but
>>> > the current form is already within noise of the raw intrinsics version.
>>> >
>>> > Three things I’d like input on:
>>> > 1. NEON in scope for the initial Arrow PR, or follow-up? The reduction
>>> > translates fine via xsimd::none, but the 8-lane Parquet block on 4-lane
>>> > NEON batches still needs an algorithm split, and I don’t have ARM
>>> hardware
>>> > to bench on.
>>> > 2. Probe + insert in one PR, or probe first?
>>> > 3. Anything specific you’d want to see in the PR description / commit
>>> > message for the xsimd lowering note given it’s a non-obvious result?
>>> >
>>> > Repo: https://github.com/dmatth1/quickbloom (branch
>>> parquet-xsimd-port;
>>> > see
>>> > investigations/parquet_port/XSIMD_PORT.md for the full A/B, codegen
>>> > analysis, and earlier-attempts table)
>>> >
>>> > Regards,
>>> > Dan
>>> >
>>> > El El sáb, may 16, 2026 a la(s) 1:36 a.m., Antoine Pitrou <
>>> > [email protected]> escribió:
>>> >
>>> > >
>>> > > Hi Dan,
>>> > >
>>> > > Thanks for proposing this.
>>> > >
>>> > > The #1 requirement I have is to use xsimd for this, like we do for
>>> other
>>> > > SIMD optimizations in Arrow.
>>> > > https://xsimd.readthedocs.io/
>>> > >
>>> > > Regards
>>> > >
>>> > > Antoine.
>>> > >
>>> > >
>>> > > Le 16/05/2026 à 02:55, Dan Mattheiss a écrit :
>>> > > > Hello everyone,
>>> > > >
>>> > > > *cc @mapleFU and @pitrou, who’ve been the most active reviewers on
>>> > > > parquet-cpp bloom filter code recently.*
>>> > > >
>>> > > > arrow-go shipped AVX2/SSE4/NEON SBBF probes in 18.3.0
>>> > > > (apache/arrow-go#336). Impala and Kudu have had AVX2 SBBF probes
>>> for
>>> > > years.
>>> > > > arrow-cpp still ships the scalar reference in
>>> > > > BlockSplitBloomFilter::FindHash. I’d like to discuss porting the
>>> > arrow-go
>>> > > > approach to C++ before opening a PR. This is a read-path
>>> implementation
>>> > > > change with no on-disk format implications, so I’m raising it here
>>> > rather
>>> > > > than on parquet-dev.
>>> > > >
>>> > > >
>>> > > > *Scope*
>>> > > > On-disk format unchanged. SALT, XXH64, bucket index unchanged.
>>> Only the
>>> > > > probe implementation changes — an AVX2 variant selected at runtime
>>> via
>>> > > > arrow::internal::CpuInfo, with the existing scalar path retained
>>> for
>>> > > > non-AVX2 builds. No API change, no behavior change for any reader.
>>> > > >
>>> > > > *The change*
>>> > > > About a dozen instructions, no branches:
>>> > > >
>>> > > > bool find_avx2(uint64_t hash) const {
>>> > > > uint32_t bucket_index = uint32_t(((hash >> 32) * num_blocks_) >>
>>> 32);
>>> > > > uint32_t key = uint32_t(hash);
>>> > > >
>>> > > > __m256i hash_v = _mm256_set1_epi32(int32_t(key));
>>> > > > __m256i salt = _mm256_load_si256(reinterpret_cast<const
>>> > __m256i*>(SALT));
>>> > > > __m256i prod = _mm256_mullo_epi32(hash_v, salt);
>>> > > > __m256i shift = _mm256_srli_epi32(prod, 27);
>>> > > > __m256i ones = _mm256_set1_epi32(1);
>>> > > > __m256i mask = _mm256_sllv_epi32(ones, shift);
>>> > > >
>>> > > > const __m256i* blk = reinterpret_cast<const __m256i*>(data_ +
>>> > > bucket_index
>>> > > > * 32);
>>> > > > return _mm256_testc_si256(_mm256_loadu_si256(blk), mask);
>>> > > > }
>>> > > >
>>> > > >
>>> > > > _mm256_mullo_epi32(hash_v, salt) produces eight 32-bit products
>>> each
>>> > > equal
>>> > > > to scalar key * SALT[i]. _mm256_testc_si256 returns 1 iff (~block &
>>> > mask)
>>> > > > == 0 — the same condition as the scalar 8-iteration
>>> > AND-with-early-exit.
>>> > > >
>>> > > > *Correctness*
>>> > > > Bit-identical to the existing scalar reference: a diff test
>>> against the
>>> > > > vendored FindHash from cpp/src/parquet/bloom_filter.cc reports 0
>>> > > mismatches
>>> > > > across 167M (query, filter) pairs. The same diff test would run in
>>> CI
>>> > to
>>> > > > guard against drift between the two paths.
>>> > > >
>>> > > > *Numbers*
>>> > > > Intel Xeon @ 2.8 GHz (Cascade Lake-class, 33 MiB L3, virtualised),
>>> g++
>>> > > -O3
>>> > > > -mavx2 -mbmi2. Post-hash probe latency; XXH64 cost excluded from
>>> the
>>> > > timed
>>> > > > loop.
>>> > > >
>>> > > > |Regime |Filter footprint|scalar |AVX2 single-key |AVX2 4-way bulk|
>>> > > >
>>> > >
>>> >
>>> |------------------|----------------|--------|----------------|---------------|
>>> > > > |Small (in-cache) |0.5 MiB |12.73 ns|3.70 ns (3.4×) |2.36 ns
>>> (5.4×) |
>>> > > > |Medium (out-of-L3)|128 MiB |35.84 ns|29.18 ns (1.2×) |22.79 ns
>>> (1.6×)|
>>> > > > |Large (deep DRAM) |1 GiB |41.41 ns|35.90 ns (1.15×)|27.75 ns
>>> (1.5×)|
>>> > > >
>>> > > > In-cache the win is ALU collapse: 8 dependent loads with branches
>>> > becomes
>>> > > > two SIMD ops after one cache-line load. Out-of-L3 both paths wait
>>> for
>>> > the
>>> > > > same DRAM load and the per-probe gain narrows; the 4-way bulk path
>>> > > overlaps
>>> > > > four cache misses in flight and recovers most of it. Most useful
>>> for
>>> > > WHERE
>>> > > > col IN (...) and large-table scans across many row-group filters.
>>> > > >
>>> > > > *Not in scope for this discussion*
>>> > > > • AVX-512 evaluated and rejected — frequency throttle on Sapphire
>>> > Rapids
>>> > > > exceeds the width gain on this workload, and SBBF mask compute
>>> doesn’t
>>> > > > saturate ZMM. The variants-rejected table in the linked repo has
>>> the
>>> > > data.
>>> > > > • NEON / SVE2 not measured. Intrinsics map cleanly on paper but I
>>> don’t
>>> > > > want to send hand-wave estimates. Happy to do this as a follow-up
>>> if
>>> > > > there’s interest.
>>> > > >
>>> > > > *What I’d like feedback on*
>>> > > > 1. Preference on landing probe + insert together, or probe first
>>> as a
>>> > > > smaller PR?
>>> > > > 2. Should NEON block the AVX2 PR, or is it acceptable as a
>>> follow-up?
>>> > > > 3. Any concerns with the diff-test-in-CI approach as the regression
>>> > guard
>>> > > > between scalar and AVX2 paths?
>>> > > >
>>> > > > *Links*
>>> > > > • arrow-go prior art: https://github.com/apache/arrow-go/pull/336
>>> > > > • Bench, vendored references, diff test:
>>> > > > https://github.com/dmatth1/quickbloom (under
>>> > > investigations/parquet_port/)
>>> > > > • Longer write-up: https://dmatth1.github.io/posts/parquet-bloom/
>>> > > >
>>> > > > Thanks,
>>> > > > Dan
>>> > > >
>>> > >
>>> > >
>>> >
>>>
>>

Reply via email to