zhaohaidao opened a new issue, #9307:
URL: https://github.com/apache/arrow-rs/issues/9307

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   We hit this while using **fluss-rust column projection**: when projecting 
more fields, LZ4 decompression cost grows rapidly. In practice, with **174/775 
projected columns**, consumption capacity drops sharply and d
   ecompression becomes the dominant bottleneck. This points to the LZ4 IPC 
decode path (arrow-ipc) as the critical limiter for high-cardinality 
projections.
   
   Concrete observations (same workload):
   - arrow-rs (frame decoder): `avg_decode_ms ≈ 26.4`, `decode_util ≈ 99.8%`
   - arrow-java (decoder=arrow): `avg_decode_ms ≈ 0.64`, `decode_util ≈ 11.6%`
   
   This strongly suggests the streaming FrameDecoder path (state machine + Read 
trait + buffer resize/zero-init) adds significant overhead beyond core LZ4 
block decompression.
   
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   Add a direct LZ4 frame parsing path in `arrow-ipc` that avoids the streaming 
FrameDecoder overhead while keeping the same correctness guarantees:
   
   1) **Header parsing & validation**
      - magic / version / reserved bits / block size / block independence
      - header checksum (XXH32 >> 8)
      - content size (if present)
      - dictionary id: return error (same behavior as current path)
   
   2) **Per-block handling**
      - read block size + incompressible flag
      - block checksum verification if enabled
      - compressed block → `lz4_flex::block::decompress_into`
      - incompressible block → direct copy
   
   3) **Content checksum**
      - compute XXH32 on decompressed output
      - verify at end if enabled
   
   4) **Output**
      - pre-allocate `decompressed_size`, write directly into output
      - avoid `read_exact + vec_resize_and_get_mut` / buffer re-init
   
   Compatibility goal: no API changes; internal implementation swap. Keep 
current semantics (dict id unsupported, checksum and size checks preserved).
   
   Expected outcome (observed in local prototype):
   - arrow-rs (optimized): `avg_decode_ms ≈ 0.47`, `decode_util ≈ 8.6%`
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   
   Top bottlenecks & mitigations:
   1) **LZ4 block core decompression**
      - Dominant CPU cost; cannot be removed.
      - Optimization removes surrounding overhead so this becomes the only hot 
path.
   2) **FrameDecoder state machine + Read trait path**
      - Significant overhead due to streaming logic and state transitions.
      - Direct frame parsing removes it entirely.
   3) **Buffer management / extra copies**
      - `read_exact + vec_resize_and_get_mut` causes resize/zero-init overhead.
      - Pre-allocated output + direct block writes reduce allocations.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to