[I] Benchmark for filter+concat and take+concat into even sized record batches [arrow-rs]

via GitHub Mon, 02 Jun 2025 04:39:02 -0700


alamb opened a new issue, #7589:
URL: https://github.com/apache/arrow-rs/issues/7589


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   It is a common pattern in DataFusion to take some subset of rows from an 
input stream of RecordBatches and produces evenly sized (e.g. 8192 bytes) 
output RecordBatches. The subset is either described by a BooleanArray 
(`filter`) or a set of indices (`take`) This is explained here:
   - https://github.com/apache/arrow-rs/issues/6692
   
   I have hit the need for the same low level primitive while working on 
improving parquet filtering performance
   - https://github.com/apache/arrow-rs/issues/7456
   
   The current way to accomplish this is to apply the `filter` kernel to make 
several smaller RecordBatches and then `concat` to put them all together. This 
mechanism has two disavantages:
   1. It copies the data twice (which is especially slow for Utf8/Binary and 
other variable length arrays) with multiple allocations
   2. It requiring *buffering* all the input batches until enough are ready to 
form the output (which is especially problematic for Utf8View and other view 
types where the result of filtering may still hold significant amounts of 
memory, and sometimes requires copying the views more than once (see 
[here](https://github.com/apache/datafusion/blob/9d2f04996604e709ee440b65f41e7b882f50b788/datafusion/physical-plan/src/coalesce/mod.rs#L222-L221))
   
   However, it turns out that the filter and concat kernels are very highly 
optimized so getting better performance is actually quite challenging as 
described on https://github.com/apache/arrow-rs/pull/7513
   
   **Describe the solution you'd like**
   I would like to optimize the filter+concat operation, especially for 
variable length data arrays
   
   To do so let's start with a benchmark
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Benchmark for filter+concat and take+concat into even sized record batches [arrow-rs]

Reply via email to