daniel-adam-tfs opened a new pull request, #654:
URL: https://github.com/apache/arrow-go/pull/654
### Rationale for this change
The byte-stream-split encoding is commonly used in Parquet for
floating-point data, as it improves compression ratios by grouping similar
bytes together. However, the existing Go implementation uses a simple scalar
loop which is inefficient for large datasets. By leveraging SIMD instructions
(AVX2 on x86 and NEON on ARM), we can significantly accelerate the decoding
process and improve overall Parquet read performance.
### What changes are included in this PR?
Optimized implementation of byte-stream split decoding algorithm.
Added SIMD-accelerated implementations:
AVX2 implementation for amd64 architecture using 256-bit vectors processing
32 values per block
NEON implementation for arm64 architecture using 128-bit vectors processing
16 values per block
Both use 2-stage byte unpacking hierarchy following the same algorithm
structure
Implemented runtime CPU feature detection with automatic dispatch to the
best available implementation (SIMD vs scalar fallback)
Added proper build tags and file suffixes for cross-platform compatibility
Included an optimized V2 scalar implementation using unsafe pointer casting
as a fallback
### Are these changes tested?
Yes. Various tests were added:
- Correctness tests covering various input sizes (1, 2, 7, 8, 31, 32, 33,
63, 64, 65, 127, 128, 129, 255, 256, 512, 1024) to validate all implementations
(Reference, V2, AVX2, NEON)
- Edge case tests including exact block boundaries, single values, all-zero
data, and all-ones data
- Benchmark suite with multiple data sizes (8, 64, 512, 4096, 32768, 262144
values) comparing all implementations
### Are there any user-facing changes?
No user-facing API changes. This is a performance optimization that
maintains full backward compatibility. Users will automatically benefit from
faster Parquet decoding when reading files with byte-stream-split encoded
floating-point columns, with no code changes required.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]