dmatth1 commented on PR #50030:
URL: https://github.com/apache/arrow/pull/50030#issuecomment-4531360238
Branchless body alone (no xsimd kernel) on AVX2:
- on clang `-mavx2` it's within noise of the hand-written xsimd kernel in
every regime
- on gcc it matches except ~0.79× of xsimd in the out-of-L3 regime.
- That gap is why this PR ships a separate xsimd kernel for the AVX2 TU
rather than relying on autovec alone — on clang-only builds the xsimd kernel is
essentially a no-op but on gcc/MSVC it pins the `vptest` lowering.
Cache regime sweep: scalar vs xsimd, post-hash probe latency:
| Regime | scalar | xsimd | Speedup |
|---|---:|---:|---:|
| Small in-cache (0.5 MiB) | 12.35 ns | 2.48 ns | 5.0× |
| Medium out-of-L3 (128 MiB) | 18.40 ns | 7.41 ns | 2.5× |
| Large deep DRAM (1 GiB) | 31.05 ns | 22.10 ns | 1.4× |
These numbers are with the `as_batch_bool` xsimd form (~1 cycle faster
in-cache than the shipped `miss != 0` spelling — out-of-cache regimes
unchanged) and the post-hash only (XXH64 excluded) so absolute values don't
compare directly to the end-to-end commit-body table. The regime shape (biggest
gain in-cache, smallest in DRAM) holds for the shipped form.
Can re-bench in-tree with the commit if you want directly-comparable numbers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]