pchintar opened a new issue, #9947:
URL: https://github.com/apache/arrow-rs/issues/9947
## Description
Currently, `BatchCoalescer::push_batch_with_filter` materializes a filtered
`RecordBatch` before coalescing it into output batches.
This introduces unnecessary intermediate array allocations and duplicate
value copies during filtered coalescing, especially for numeric and timestamp
columns.
---
## Root Cause
In `arrow-select/src/coalesce.rs`, filtered coalescing is structured as:
```text
RecordBatch
→ filter_record_batch()
→ temporary filtered RecordBatch
→ push_batch()
→ coalesced output
````
The current implementation is:
```rust
pub fn push_batch_with_filter(
&mut self,
batch: RecordBatch,
filter: &BooleanArray,
) -> Result<(), ArrowError> {
let filtered_batch = filter_record_batch(&batch, filter)?;
self.push_batch(filtered_batch)
}
```
This means selected values can be copied twice:
```text
1. filter_record_batch() copies selected values into temporary filtered
arrays
2. push_batch() copies those values again into the coalescer output buffers
```
---
## Current Behavior
For filtered batches:
```text
1. Allocate temporary filtered arrays
2. Build temporary filtered RecordBatch
3. Copy selected values into temporary arrays
4. Copy selected values again into coalescer buffers
5. Drop temporary arrays and RecordBatch
```
### Implications
* unnecessary temporary array allocations
* duplicate value copies
* extra null bitmap materialization
* additional allocator and memory overhead
* increased latency in filtered coalescing workloads
---
## Proposed Solution
Filtered batch coalescing should ideally avoid materializing temporary
filtered arrays for numeric and timestamp columns.
Instead, selected values could be appended directly into the coalescer
output buffers during filtered coalescing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]