This is generally ok to me. I didn't write a SIMD probe because in our
scenario the probe is not the bottleneck.
But welcome to contribute to it!

I also noticed that in arrow-cpp, it has a SIMD-implemented SBBF. I don't
know whether it's ok to share a
same implementation or extract it out.

Best,
Xuwei Fu

Dan Mattheiss <[email protected]> 于2026年5月17日周日 04:21写道:

> Hi Antoine,
>
> Done — xsimd port is in branch parquet-xsimd-port of the repo, diff-test
> bit-identical to scalar (167M pairs, 0 mismatches), builds clean under
> -Wall -Wextra.
>
> 20-rep medians on Sapphire Rapids @ 2.1 GHz (virtualised), post-hash probe
> (XXH64 computed at setup, excluded from the timed loop):
>
> |Regime |scalar|xsimd single-key|xsimd 4-way bulk|
> |------------------|-----:|---------------:|---------------:|
> |Small (in-cache) |12.35 |2.48 (5.0×) |1.73 (7.1×) |
> |Medium (out-of-L3)|18.40 |7.41 (2.5×) |7.17 (2.6×) |
> |Large (deep DRAM) |31.05 |22.10 (1.4×) |18.67 (1.7×) |
>
> Same-machine A/B against raw _mm256_* intrinsics is within noise across
> every regime: -3% to +8%, with the +8% only on the single-key in-cache
> path. The xsimd form has one extra vpandn vs raw testc (which folds the
> andnot into vptest), which accounts for the ~1 cycle on the in-cache hot
> path. A direct xsimd::testc(batch, batch) would close it.
>
> One finding worth recording for the PR description. The natural xsimd
> translation of _mm256_testc_si256 — xsimd::all((blk & mask) == mask) —
> lowers to 5 instructions through a vpcmpeqd + vpcmpeqd(allones) + vptest
> chain rather than a single vptest. xsimd’s public testc(batch, batch)
> doesn’t exist, but xsimd::none(as_batch_bool(bitwise_andnot(mask, blk)))
> lowers to vpandn + vptest — one more op than raw testc but otherwise the
> same shape. That’s the form in the current branch.
>
> A small upstream xsimd::testc(batch, batch) PR would be a nice cleanup
> (folds the andnot into the vptest, recovers the ~1 cycle noted above) but
> the current form is already within noise of the raw intrinsics version.
>
> Three things I’d like input on:
> 1. NEON in scope for the initial Arrow PR, or follow-up? The reduction
> translates fine via xsimd::none, but the 8-lane Parquet block on 4-lane
> NEON batches still needs an algorithm split, and I don’t have ARM hardware
> to bench on.
> 2. Probe + insert in one PR, or probe first?
> 3. Anything specific you’d want to see in the PR description / commit
> message for the xsimd lowering note given it’s a non-obvious result?
>
> Repo: https://github.com/dmatth1/quickbloom (branch parquet-xsimd-port;
> see
> investigations/parquet_port/XSIMD_PORT.md for the full A/B, codegen
> analysis, and earlier-attempts table)
>
> Regards,
> Dan
>
> El El sáb, may 16, 2026 a la(s) 1:36 a.m., Antoine Pitrou <
> [email protected]> escribió:
>
> >
> > Hi Dan,
> >
> > Thanks for proposing this.
> >
> > The #1 requirement I have is to use xsimd for this, like we do for other
> > SIMD optimizations in Arrow.
> > https://xsimd.readthedocs.io/
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 16/05/2026 à 02:55, Dan Mattheiss a écrit :
> > > Hello everyone,
> > >
> > > *cc @mapleFU and @pitrou, who’ve been the most active reviewers on
> > > parquet-cpp bloom filter code recently.*
> > >
> > > arrow-go shipped AVX2/SSE4/NEON SBBF probes in 18.3.0
> > > (apache/arrow-go#336). Impala and Kudu have had AVX2 SBBF probes for
> > years.
> > > arrow-cpp still ships the scalar reference in
> > > BlockSplitBloomFilter::FindHash. I’d like to discuss porting the
> arrow-go
> > > approach to C++ before opening a PR. This is a read-path implementation
> > > change with no on-disk format implications, so I’m raising it here
> rather
> > > than on parquet-dev.
> > >
> > >
> > > *Scope*
> > > On-disk format unchanged. SALT, XXH64, bucket index unchanged. Only the
> > > probe implementation changes — an AVX2 variant selected at runtime via
> > > arrow::internal::CpuInfo, with the existing scalar path retained for
> > > non-AVX2 builds. No API change, no behavior change for any reader.
> > >
> > > *The change*
> > > About a dozen instructions, no branches:
> > >
> > > bool find_avx2(uint64_t hash) const {
> > > uint32_t bucket_index = uint32_t(((hash >> 32) * num_blocks_) >> 32);
> > > uint32_t key = uint32_t(hash);
> > >
> > > __m256i hash_v = _mm256_set1_epi32(int32_t(key));
> > > __m256i salt = _mm256_load_si256(reinterpret_cast<const
> __m256i*>(SALT));
> > > __m256i prod = _mm256_mullo_epi32(hash_v, salt);
> > > __m256i shift = _mm256_srli_epi32(prod, 27);
> > > __m256i ones = _mm256_set1_epi32(1);
> > > __m256i mask = _mm256_sllv_epi32(ones, shift);
> > >
> > > const __m256i* blk = reinterpret_cast<const __m256i*>(data_ +
> > bucket_index
> > > * 32);
> > > return _mm256_testc_si256(_mm256_loadu_si256(blk), mask);
> > > }
> > >
> > >
> > > _mm256_mullo_epi32(hash_v, salt) produces eight 32-bit products each
> > equal
> > > to scalar key * SALT[i]. _mm256_testc_si256 returns 1 iff (~block &
> mask)
> > > == 0 — the same condition as the scalar 8-iteration
> AND-with-early-exit.
> > >
> > > *Correctness*
> > > Bit-identical to the existing scalar reference: a diff test against the
> > > vendored FindHash from cpp/src/parquet/bloom_filter.cc reports 0
> > mismatches
> > > across 167M (query, filter) pairs. The same diff test would run in CI
> to
> > > guard against drift between the two paths.
> > >
> > > *Numbers*
> > > Intel Xeon @ 2.8 GHz (Cascade Lake-class, 33 MiB L3, virtualised), g++
> > -O3
> > > -mavx2 -mbmi2. Post-hash probe latency; XXH64 cost excluded from the
> > timed
> > > loop.
> > >
> > > |Regime |Filter footprint|scalar |AVX2 single-key |AVX2 4-way bulk|
> > >
> >
> |------------------|----------------|--------|----------------|---------------|
> > > |Small (in-cache) |0.5 MiB |12.73 ns|3.70 ns (3.4×) |2.36 ns (5.4×) |
> > > |Medium (out-of-L3)|128 MiB |35.84 ns|29.18 ns (1.2×) |22.79 ns (1.6×)|
> > > |Large (deep DRAM) |1 GiB |41.41 ns|35.90 ns (1.15×)|27.75 ns (1.5×)|
> > >
> > > In-cache the win is ALU collapse: 8 dependent loads with branches
> becomes
> > > two SIMD ops after one cache-line load. Out-of-L3 both paths wait for
> the
> > > same DRAM load and the per-probe gain narrows; the 4-way bulk path
> > overlaps
> > > four cache misses in flight and recovers most of it. Most useful for
> > WHERE
> > > col IN (...) and large-table scans across many row-group filters.
> > >
> > > *Not in scope for this discussion*
> > > • AVX-512 evaluated and rejected — frequency throttle on Sapphire
> Rapids
> > > exceeds the width gain on this workload, and SBBF mask compute doesn’t
> > > saturate ZMM. The variants-rejected table in the linked repo has the
> > data.
> > > • NEON / SVE2 not measured. Intrinsics map cleanly on paper but I don’t
> > > want to send hand-wave estimates. Happy to do this as a follow-up if
> > > there’s interest.
> > >
> > > *What I’d like feedback on*
> > > 1. Preference on landing probe + insert together, or probe first as a
> > > smaller PR?
> > > 2. Should NEON block the AVX2 PR, or is it acceptable as a follow-up?
> > > 3. Any concerns with the diff-test-in-CI approach as the regression
> guard
> > > between scalar and AVX2 paths?
> > >
> > > *Links*
> > > • arrow-go prior art: https://github.com/apache/arrow-go/pull/336
> > > • Bench, vendored references, diff test:
> > > https://github.com/dmatth1/quickbloom (under
> > investigations/parquet_port/)
> > > • Longer write-up: https://dmatth1.github.io/posts/parquet-bloom/
> > >
> > > Thanks,
> > > Dan
> > >
> >
> >
>

Reply via email to