Re: [PR] Draft POC: Push batch with filter without copy [arrow-rs]

via GitHub Mon, 11 Aug 2025 08:39:41 -0700


zhuqi-lucas commented on code in PR #8103:
URL: https://github.com/apache/arrow-rs/pull/8103#discussion_r2267160776



##########
arrow-select/src/coalesce.rs:
##########
@@ -198,15 +198,139 @@ impl BatchCoalescer {
     /// let expected_batch = record_batch!(("a", Int32, [1, 3, 4, 
6])).unwrap();
     /// assert_eq!(completed_batch, expected_batch);
     /// ```
+    /// Zero-copy implementation using `compute_filter_plan`.
     pub fn push_batch_with_filter(
         &mut self,
         batch: RecordBatch,
-        filter: &BooleanArray,
+        predicate: &BooleanArray,
     ) -> Result<(), ArrowError> {
-        // TODO: optimize this to avoid materializing (copying the results
-        // of filter to a new batch)
-        let filtered_batch = filter_record_batch(&batch, filter)?;
-        self.push_batch(filtered_batch)
+        // First compute the filter plan based on the predicate
+        // (calls FilterBuilder::optimize internally)
+        let plan = compute_filter_plan(predicate);
+
+        match plan {
+            FilterPlan::None => {
+                // No rows selected
+                Ok(())
+            }
+            FilterPlan::All => {
+                // All rows selected: directly call push_batch (consumes batch)
+                self.push_batch(batch)
+            }
+            FilterPlan::Slices(slices) => {
+                // Consume the batch and set sources on in_progress arrays
+                let (_schema, arrays, _nrows) = batch.into_parts();
+                assert_eq!(arrays.len(), self.in_progress_arrays.len());
+
+                self.in_progress_arrays
+                    .iter_mut()
+                    .zip(arrays)
+                    .for_each(|(in_progress, array)| {
+                        in_progress.set_source(Some(array));
+                    });
+
+                // For each contiguous slice, copy rows in chunks fitting 
target_batch_size
+                for (mut start, end) in slices {

Review Comment:
   Thank you @alamb for review and good suggestion.
   
   I found the hot path for profile is, copy_rows:
   
   Especially for the code for null handling:
   
   ```rust
   
           // add nulls if necessary
           if let Some(nulls) = s.nulls().as_ref() {
               let nulls = nulls.slice(offset, len);
               self.nulls.append_buffer(&nulls);
           } else {
               self.nulls.append_n_non_nulls(len);
           };
   ```
   
   It may due to we will call more times for copy_rows here, but the original 
logic is just filter and concact to a batch, and then to push_batch, so it will 
be friendly for SIMD. Also it will make copy_rows SIMD friendly for bigger 
batch. So the original run is pretty faster for most cases.
   
   May be we need to only change this logic for selective < 0.005 for example, 
but it's hard for me to decide. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Draft POC: Push batch with filter without copy [arrow-rs]

Reply via email to