Re: [PR] feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead [arrow-go]

via GitHub Tue, 10 Mar 2026 13:06:39 -0700


junyan-ling commented on PR #689:
URL: https://github.com/apache/arrow-go/pull/689#issuecomment-4034119509


   > We just did a release last week, so ideally we'd wait a couple weeks at 
least before doing a new release. Depending on your own requirements, you could 
update your own go.mod to point at the commit hash directly (or use a replace 
directive) if you're okay with that. Is it urgent for you that there be a 
release with this change soon?
   
   Thanks @zeroshade ! Sounds good, I can build an internal version 
cherrypicking this commit. Thanks. 
   
   Yet I need to point out that, after cherry-pikcing this change, in the 
profiling files of e2e benchmark test k8s jobs, I notice the CPU time is not 
dramatically improved. After further investigation, here are the reasons: 
   
   1. Pre-allocation only ensures the destination buffer is large enough, and 
it does not eliminate the per-value copy. When we read a parquet page, the 
decompressed bytes live in the page buffer. Each value then gets copied 
individually from the page buffer into the BinaryBuilder's data buffer via 
bufferBuilder.Append → copy() → memmove. This copy happens for every single 
value regardless of how large the destination buffer is. Pre-allocation 
prevents the destination from needing to resize mid-batch, but the fundamental 
copy from source to destination is unavoidable. This accounts for most of the 
memmove cost in the profiles.
   2. Pre-allocation itself has a CPU cost: zeroing. Both Go's runtime and 
Arrow's memory allocator zero-fill all newly allocated memory. When ReserveData 
pre-allocates the data buffer, every byte gets zeroed via memset before any 
data is written. This zeroing is wasted work - we're about to overwrite every 
byte with actual values. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(parquet/file): pre-allocate BinaryBuilder data buffer using column chunk metadata to eliminate resize overhead [arrow-go]

Reply via email to