[PR] Reduce one copy in `SerializedPageReader` [arrow-rs]

via GitHub Thu, 30 Oct 2025 11:38:33 -0700


XiangpengHao opened a new pull request, #8745:
URL: https://github.com/apache/arrow-rs/pull/8745


   This was originally found by @MikeWalrus
   
   Basically the ChunkReader for the async reader is `ColumnChunkData`: 
https://github.com/apache/arrow-rs/blob/2eabb595d20e691cf0c9c3ccf6a5e1b67472b07b/parquet/src/arrow/in_memory_row_group.rs#L282-L292
   
   Which by itself is `Bytes`. The original implementation will copy the data 
from it and later only to make it a new `Bytes`.
   This PR removes it.
   
   Normally this should mean performance improvements across the board, but 
here're the nuances:
   1. Zero-copy means we need to hold the underlying buffer longer
   2. Original implementation "accidentally" (I'm not sure) gc'ed the buffer
   3. To show meaningful performance difference, we need to use a proper 
allocator, i.e., mimalloc
   
   tldr: with mimalloc, it will always improve performance, or at least as fast 
as the original implementation, tested locally with `arrow_reader_clickbench`
   
   cc @tustvold and @alamb who might know this better


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Reduce one copy in `SerializedPageReader` [arrow-rs]

Reply via email to