alamb commented on code in PR #6304:
URL: https://github.com/apache/arrow-rs/pull/6304#discussion_r1734546193


##########
arrow-select/src/filter.rs:
##########
@@ -1871,4 +1937,75 @@ mod tests {
             }
         }
     }
+

Review Comment:
   (I double checked and they are immediate above this -- 
   
   ```rust
       fn test_filter_union_array_sparse() {
   ...
   ```
   👍 



##########
arrow-select/src/filter.rs:
##########
@@ -166,7 +166,24 @@ pub fn prep_null_mask_filter(filter: &BooleanArray) -> 
BooleanArray {
 /// assert_eq!(c, &Int32Array::from(vec![5, 8]));
 /// ```
 pub fn filter(values: &dyn Array, predicate: &BooleanArray) -> 
Result<ArrayRef, ArrowError> {
-    let predicate = FilterBuilder::new(predicate).build();
+    let mut filter_builder = FilterBuilder::new(predicate);
+
+    fn multiple_arrays(data_type: &DataType) -> bool {

Review Comment:
   I agree a top level function would be better



##########
arrow-select/src/filter.rs:
##########
@@ -166,7 +166,24 @@ pub fn prep_null_mask_filter(filter: &BooleanArray) -> 
BooleanArray {
 /// assert_eq!(c, &Int32Array::from(vec![5, 8]));
 /// ```
 pub fn filter(values: &dyn Array, predicate: &BooleanArray) -> 
Result<ArrayRef, ArrowError> {
-    let predicate = FilterBuilder::new(predicate).build();
+    let mut filter_builder = FilterBuilder::new(predicate);
+
+    fn multiple_arrays(data_type: &DataType) -> bool {
+        match data_type {
+            DataType::Struct(fields) => {
+                fields.len() > 1 || fields.len() == 1 && 
multiple_arrays(fields[0].data_type())
+            }
+            DataType::Union(fields, UnionMode::Sparse) => !fields.is_empty(),
+            _ => false,
+        }
+    }
+
+    if multiple_arrays(values.data_type()) {
+        filter_builder = filter_builder.optimize();

Review Comment:
   I think the rationale is that "optimizing" the filter requires looking at 
the BooleanArray which itself requires non trivial time, so for certain 
operations the overhead of figuring out a better filter strategy takes more 
time than actually running it
   
   This is basically the same algorithm used in `filter_record_batch`: 
https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#179-183
   
   Users who want to always optimize can use a `FilterBuilder` and explicitly 
call `optimize`  
https://docs.rs/arrow/latest/arrow/compute/struct.FilterBuilder.html#method.optimize
   
   I made a PR to try and clarify this in the docs 
https://github.com/apache/arrow-rs/pull/6317 
   
   



##########
arrow-array/src/cast.rs:
##########
@@ -815,6 +815,14 @@ pub trait AsArray: private::Sealed {
         self.as_struct_opt().expect("struct array")
     }
 
+    /// Downcast this to a [`UnionArray`] returning `None` if not possible

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to