alamb opened a new issue, #7765: URL: https://github.com/apache/arrow-rs/issues/7765
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** [`GenericInProgressArray`](https://github.com/apache/arrow-rs/blob/2b40d1dfc35862ff350a40dfbc66f8a14f4eea31/arrow-select/src/coalesce/generic.rs#L31) is almost entirely dominated by copying strings to new buffers. It copies to new buffers to avoid accumulating large numbers of buffers that each have only a small number of rows pointed at It has several optimizations to avoid copying and optimizing this copy when possible The coalesce kernel has special logic to recycle string view buffers when they are not used much (TODO link) I have a as yet unproven thesis that we could speed up the coalesce kernel by special casing when the underlying buffer is the same. The high level idea is that in the case of reading from Parquet the same string buffer will be used for several batches, so if the coalesce kernel detected this maybe we could avoid some copies. I intend to use the `coalesce` kernel to make parquet reading faster **Describe the solution you'd like** Make benchmarks kernel faster **Describe alternatives you've considered** The first thing I would do is check an actual parquet benchmark that the same `Buffer`s are used for multiple RecordBatches that come out of the reader: ```shell cargo bench --features=arrow,async --bench arrow_reader_clickbench ``` If that is the case, then I would then make a benchmark that replicates the pattern (e.g. create a record batch with 32K rows, and then slice it up and send it in 8k row chunks) Then I would try and optimize it. For example check pointer equality and delay the string copies until it saw a new buffer pointer **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
