hamilton-earthscope opened a new pull request, #557: URL: https://github.com/apache/arrow-go/pull/557
### Rationale for this change When writing data to partitioned Iceberg tables using `iceberg-go`'s new partitioned write capability, I found that Arrow's `compute.Take` operation was causing the writes to be significantly slower than unpartitioned writes. Cause: the `VarBinaryImpl` function here in `arrow-go` was pre-allocating the data buffer with only `meanValueLen` (size for just one string/buffer), not the total size needed for all strings. For, my use case (writing 100k rows at a time) this was causing 15x buffer reallocations as the buffer incrementally scales up. For my iceberg table with 20+ string/binary cols of various size this is significant overhead. ### What changes are included in this PR? Pre-allocate upfront with a better estimate of the necessary buffer size to eliminate repeated reallocations. I somewhat arbitrarily chose a cap of 16MB as a guess at what would be effective at reducing the number of allocations but also trying to be cognizant of the library being used in many bespoke scenarios and not wanting to make massive memory spikes. For my use case, I never hit this 16MB threshold and it could be smaller. I am curious for your input on whether there should be a cap at all or what a reasonable cap would be. ### Are these changes tested? No unit tests. However, for my use case - writing 100k rows to a partitioned iceberg table: - before: `table.Append()` took 30s on average - after: `table.Append()` takes 2.5s on average ### Are there any user-facing changes? No; just more performant -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
