encoding): vectorize amd64 bool unpack [arrow-go]

via GitHub Fri, 27 Mar 2026 13:27:45 -0700


zeroshade opened a new pull request, #735:
URL: https://github.com/apache/arrow-go/pull/735


   ### Rationale for this change
   The SSE4 and AVX2 implementations of _bytes_to_bools in 
parquet/internal/utils/ contain zero SIMD instructions. They completely failed 
to auto-vectorize the C loop, producing purely scalar code (movzx/shr/and/mov 
one bit at a time). The SSE4 and AVX2 .s files are byte-for-byte identical — 
just scalar code with different labels.
   
   This is the amd64 counterpart to #731 which fixed the same issue on ARM64 
NEON.
   
   ### What changes are included in this PR?
   
   Rewrote both assembly implementations with actual SIMD vectorized code.
   SSE4 (unpack_bool_sse4_amd64.s) — processes 2 input bytes → 16 output bools 
per iteration:
   
   1. MOVWLZX + MOVD — load 2 input bytes into XMM
   2. PSHUFB — broadcast byte 0 → lanes 0-7, byte 1 → lanes 8-15
   3. PAND + PCMPEQB — parallel bit-test against mask [1,2,4,8,16,32,64,128] × 2
   4. PAND — normalize 0xFF → 0x01 for valid Go bool values
   5. MOVOU — store 16 output bools at once
   
   AVX2 (unpack_bool_avx2_amd64.s) — processes 4 input bytes → 32 output bools 
per iteration:
   
   1. MOVL + MOVD + VPBROADCASTD — load and broadcast 4 bytes across all 32 YMM 
lanes
   2. VPSHUFB — distribute each byte to its 8 corresponding lanes
   3. VPAND + VPCMPEQB + VPAND — parallel bit-test + normalize to 0/1
   4. VMOVDQU — store 32 output bools at once
   5. VZEROUPPER — avoid SSE-AVX transition penalties on return
   
   Both include scalar tails for when fewer than vector-width output slots 
remain.
   
   ### Are these changes tested?
   
   All existing tests continue to pass, new tests added to further validate:
   
   - TestBytesToBoolsCorrectness — validates every bit position against the 
reference Go implementation for sizes 1–1024 bytes
   - TestBytesToBoolsOutlenSmaller — edge case where output is smaller than 8× 
input
   - BenchmarkBytesToBools — parametric benchmark at 64B, 256B, 1KB, 4KB, 16KB
   
   
   ### Are there any user-facing changes?
   No, this is purely a performance optimization:
   
   *Benchmark Results (AMD Ryzen 7 7800X3D, linux/amd64)*
   
   ```
                                  baseline (scalar)   optimized (AVX2)
                                      sec/op              sec/op       vs base
   BytesToBools/bytes=64-16           146.0n              15.60n     -89.32% 
(p=0.008)
   BytesToBools/bytes=256-16          562.3n              63.36n     -88.73% 
(p=0.008)
   BytesToBools/bytes=1K-16           2.247µ              253.9n     -88.70% 
(p=0.008)
   BytesToBools/bytes=4K-16           8.970µ              1.018µ     -88.65% 
(p=0.008)
   BytesToBools/bytes=16K-16         35.798µ              4.044µ     -88.70% 
(p=0.008)
   geomean                            2.262µ              252.8n     -88.82%
   ```
   
   Throughput: 432 MiB/s → 3,853 MiB/s (+795%)
   Zero allocations in both versions. All results statistically significant.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(parquet/internal/encoding): vectorize amd64 bool unpack [arrow-go]

Reply via email to