zhaohaidao opened a new issue, #9307:
URL: https://github.com/apache/arrow-rs/issues/9307
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
<!--
A clear and concise description of what the problem is. Ex. I'm always
frustrated when [...]
(This section helps Arrow developers understand the context and *why* for
this feature, in addition to the *what*)
-->
We hit this while using **fluss-rust column projection**: when projecting
more fields, LZ4 decompression cost grows rapidly. In practice, with **174/775
projected columns**, consumption capacity drops sharply and d
ecompression becomes the dominant bottleneck. This points to the LZ4 IPC
decode path (arrow-ipc) as the critical limiter for high-cardinality
projections.
Concrete observations (same workload):
- arrow-rs (frame decoder): `avg_decode_ms ≈ 26.4`, `decode_util ≈ 99.8%`
- arrow-java (decoder=arrow): `avg_decode_ms ≈ 0.64`, `decode_util ≈ 11.6%`
This strongly suggests the streaming FrameDecoder path (state machine + Read
trait + buffer resize/zero-init) adds significant overhead beyond core LZ4
block decompression.
**Describe the solution you'd like**
<!--
A clear and concise description of what you want to happen.
-->
Add a direct LZ4 frame parsing path in `arrow-ipc` that avoids the streaming
FrameDecoder overhead while keeping the same correctness guarantees:
1) **Header parsing & validation**
- magic / version / reserved bits / block size / block independence
- header checksum (XXH32 >> 8)
- content size (if present)
- dictionary id: return error (same behavior as current path)
2) **Per-block handling**
- read block size + incompressible flag
- block checksum verification if enabled
- compressed block → `lz4_flex::block::decompress_into`
- incompressible block → direct copy
3) **Content checksum**
- compute XXH32 on decompressed output
- verify at end if enabled
4) **Output**
- pre-allocate `decompressed_size`, write directly into output
- avoid `read_exact + vec_resize_and_get_mut` / buffer re-init
Compatibility goal: no API changes; internal implementation swap. Keep
current semantics (dict id unsupported, checksum and size checks preserved).
Expected outcome (observed in local prototype):
- arrow-rs (optimized): `avg_decode_ms ≈ 0.47`, `decode_util ≈ 8.6%`
**Describe alternatives you've considered**
<!--
A clear and concise description of any alternative solutions or features
you've considered.
-->
**Additional context**
<!--
Add any other context or screenshots about the feature request here.
-->
Top bottlenecks & mitigations:
1) **LZ4 block core decompression**
- Dominant CPU cost; cannot be removed.
- Optimization removes surrounding overhead so this becomes the only hot
path.
2) **FrameDecoder state machine + Read trait path**
- Significant overhead due to streaming logic and state transitions.
- Direct frame parsing removes it entirely.
3) **Buffer management / extra copies**
- `read_exact + vec_resize_and_get_mut` causes resize/zero-init overhead.
- Pre-allocated output + direct block writes reduce allocations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]