andishgar commented on PR #46730: URL: https://github.com/apache/arrow/pull/46730#issuecomment-3016821941
@mapleFU @pitrou I believe this pull request is related to several other PRs I've submitted. Here's a summary: 1- API and Handling of the Last Buffer In [this pull request](https://github.com/apache/arrow/pull/46655), I demonstrated that it’s possible to [share buffers](https://github.com/apache/arrow/blob/a5dfadba3626c082235d9ea22db6f2cb22398d9a/cpp/src/arrow/array/builder_binary.cc#L90) without copying or finalizing the last buffer. This avoids [relocating the buffer](https://github.com/apache/arrow/blob/ed13cedd8bf7ddc06db152f97e68d86c2c37e949/cpp/src/arrow/array/builder_binary.h#L563) to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes. 2- >Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time. In [this pull request](https://github.com/apache/arrow/pull/46229), I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in [this issue](https://github.com/apache/arrow/issues/45639), I think this metadata could help determine when CompactArray should be called. Overall, my suggestion is to either modify this pull request or create a new API to support buffer sharing. It is possible to decide whether a created array should be compacted based on some metadata, in order to avoid memory bloat. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
