alamb opened a new issue, #7589: URL: https://github.com/apache/arrow-rs/issues/7589
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** It is a common pattern in DataFusion to take some subset of rows from an input stream of RecordBatches and produces evenly sized (e.g. 8192 bytes) output RecordBatches. The subset is either described by a BooleanArray (`filter`) or a set of indices (`take`) This is explained here: - https://github.com/apache/arrow-rs/issues/6692 I have hit the need for the same low level primitive while working on improving parquet filtering performance - https://github.com/apache/arrow-rs/issues/7456 The current way to accomplish this is to apply the `filter` kernel to make several smaller RecordBatches and then `concat` to put them all together. This mechanism has two disavantages: 1. It copies the data twice (which is especially slow for Utf8/Binary and other variable length arrays) with multiple allocations 2. It requiring *buffering* all the input batches until enough are ready to form the output (which is especially problematic for Utf8View and other view types where the result of filtering may still hold significant amounts of memory, and sometimes requires copying the views more than once (see [here](https://github.com/apache/datafusion/blob/9d2f04996604e709ee440b65f41e7b882f50b788/datafusion/physical-plan/src/coalesce/mod.rs#L222-L221)) However, it turns out that the filter and concat kernels are very highly optimized so getting better performance is actually quite challenging as described on https://github.com/apache/arrow-rs/pull/7513 **Describe the solution you'd like** I would like to optimize the filter+concat operation, especially for variable length data arrays To do so let's start with a benchmark **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org