alamb opened a new issue, #7765:
URL: https://github.com/apache/arrow-rs/issues/7765

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
[`GenericInProgressArray`](https://github.com/apache/arrow-rs/blob/2b40d1dfc35862ff350a40dfbc66f8a14f4eea31/arrow-select/src/coalesce/generic.rs#L31)
 is almost entirely dominated by copying strings to new buffers. It copies to 
new buffers to avoid accumulating large numbers of buffers that each have only 
a small number of rows pointed at
   
   It has several optimizations to avoid copying and optimizing this copy when 
possible
   The coalesce kernel has special logic to recycle string view buffers when 
they are not used much (TODO link)
   
   I have a as yet unproven thesis that we could speed up the coalesce kernel 
by special casing when the underlying buffer is the same. 
   
   The high level idea is that in the case of reading from Parquet the same 
string buffer will be used for several batches, so if the coalesce kernel 
detected this maybe we could avoid some copies. I intend to use the `coalesce` 
kernel to make parquet reading faster
   
   
   **Describe the solution you'd like**
   
   Make benchmarks kernel faster
   
   
   **Describe alternatives you've considered**
   
   The first thing I would do is check an actual parquet benchmark that the 
same `Buffer`s are used for multiple RecordBatches that come out of the reader:
   ```shell
   cargo bench --features=arrow,async --bench arrow_reader_clickbench
   ```
   
   If that is the case, then I would then make a benchmark that replicates the 
pattern (e.g. create a record batch with 32K rows, and then slice it up and 
send it in 8k row chunks)
   
   Then I would try and optimize it. For example check pointer equality  and 
delay the string copies until it saw a new buffer pointer
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to