[PR] perf: prealloc data builder with estimated total output size [arrow-go]

via GitHub Wed, 29 Oct 2025 22:44:30 -0700


hamilton-earthscope opened a new pull request, #557:
URL: https://github.com/apache/arrow-go/pull/557


   ### Rationale for this change
   
   When writing data to partitioned Iceberg tables using `iceberg-go`'s new 
partitioned write capability, I found that Arrow's `compute.Take` operation was 
causing the writes to be significantly slower than unpartitioned writes.
   
   Cause: the `VarBinaryImpl` function here in `arrow-go` was pre-allocating 
the data buffer with only `meanValueLen` (size for just one string/buffer), not 
the total size needed for all strings. For, my use case (writing 100k rows at a 
time) this was causing 15x buffer reallocations as the buffer incrementally 
scales up. For my iceberg table with 20+ string/binary cols of various size 
this is significant overhead.
   
   ### What changes are included in this PR?
   
   Pre-allocate upfront with a better estimate of the necessary buffer size to 
eliminate repeated reallocations.
   
   I somewhat arbitrarily chose a cap of 16MB as a guess at what would be 
effective at reducing the number of allocations but also trying to be cognizant 
of the library being used in many bespoke scenarios and not wanting to make 
massive memory spikes. For my use case, I never hit this 16MB threshold and it 
could be smaller.
   
   I am curious for your input on whether there should be a cap at all or what 
a reasonable cap would be.
   
   ### Are these changes tested?
   
   No unit tests.
   
   However, for my use case - writing 100k rows to a partitioned iceberg table: 
   - before: `table.Append()` took 30s on average
   - after: `table.Append()` takes 2.5s on average
   
   ### Are there any user-facing changes?
   
   No; just more performant
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf: prealloc data builder with estimated total output size [arrow-go]

Reply via email to