zhuqi-lucas commented on code in PR #8103: URL: https://github.com/apache/arrow-rs/pull/8103#discussion_r2267160776
########## arrow-select/src/coalesce.rs: ########## @@ -198,15 +198,139 @@ impl BatchCoalescer { /// let expected_batch = record_batch!(("a", Int32, [1, 3, 4, 6])).unwrap(); /// assert_eq!(completed_batch, expected_batch); /// ``` + /// Zero-copy implementation using `compute_filter_plan`. pub fn push_batch_with_filter( &mut self, batch: RecordBatch, - filter: &BooleanArray, + predicate: &BooleanArray, ) -> Result<(), ArrowError> { - // TODO: optimize this to avoid materializing (copying the results - // of filter to a new batch) - let filtered_batch = filter_record_batch(&batch, filter)?; - self.push_batch(filtered_batch) + // First compute the filter plan based on the predicate + // (calls FilterBuilder::optimize internally) + let plan = compute_filter_plan(predicate); + + match plan { + FilterPlan::None => { + // No rows selected + Ok(()) + } + FilterPlan::All => { + // All rows selected: directly call push_batch (consumes batch) + self.push_batch(batch) + } + FilterPlan::Slices(slices) => { + // Consume the batch and set sources on in_progress arrays + let (_schema, arrays, _nrows) = batch.into_parts(); + assert_eq!(arrays.len(), self.in_progress_arrays.len()); + + self.in_progress_arrays + .iter_mut() + .zip(arrays) + .for_each(|(in_progress, array)| { + in_progress.set_source(Some(array)); + }); + + // For each contiguous slice, copy rows in chunks fitting target_batch_size + for (mut start, end) in slices { Review Comment: Thank you @alamb for review and good suggestion. I found the hot path for profile is, copy_rows: Especially for the code for null handling: ```rust // add nulls if necessary if let Some(nulls) = s.nulls().as_ref() { let nulls = nulls.slice(offset, len); self.nulls.append_buffer(&nulls); } else { self.nulls.append_n_non_nulls(len); }; ``` It may due to we will call more times for copy_rows here, but the original logic is just filter and concact to a batch, and then to push_batch, so it will be friendly for SIMD. Also it will make copy_rows SIMD friendly for bigger batch. So the original run is pretty faster for most cases. May be we need to only change this logic for selective < 0.005 for example, but it's hard for me to decide. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org