alamb commented on code in PR #6304:
URL: https://github.com/apache/arrow-rs/pull/6304#discussion_r1734546193
##########
arrow-select/src/filter.rs:
##########
@@ -1871,4 +1937,75 @@ mod tests {
}
}
}
+
Review Comment:
(I double checked and they are immediate above this --
```rust
fn test_filter_union_array_sparse() {
...
```
👍
##########
arrow-select/src/filter.rs:
##########
@@ -166,7 +166,24 @@ pub fn prep_null_mask_filter(filter: &BooleanArray) ->
BooleanArray {
/// assert_eq!(c, &Int32Array::from(vec![5, 8]));
/// ```
pub fn filter(values: &dyn Array, predicate: &BooleanArray) ->
Result<ArrayRef, ArrowError> {
- let predicate = FilterBuilder::new(predicate).build();
+ let mut filter_builder = FilterBuilder::new(predicate);
+
+ fn multiple_arrays(data_type: &DataType) -> bool {
Review Comment:
I agree a top level function would be better
##########
arrow-select/src/filter.rs:
##########
@@ -166,7 +166,24 @@ pub fn prep_null_mask_filter(filter: &BooleanArray) ->
BooleanArray {
/// assert_eq!(c, &Int32Array::from(vec![5, 8]));
/// ```
pub fn filter(values: &dyn Array, predicate: &BooleanArray) ->
Result<ArrayRef, ArrowError> {
- let predicate = FilterBuilder::new(predicate).build();
+ let mut filter_builder = FilterBuilder::new(predicate);
+
+ fn multiple_arrays(data_type: &DataType) -> bool {
+ match data_type {
+ DataType::Struct(fields) => {
+ fields.len() > 1 || fields.len() == 1 &&
multiple_arrays(fields[0].data_type())
+ }
+ DataType::Union(fields, UnionMode::Sparse) => !fields.is_empty(),
+ _ => false,
+ }
+ }
+
+ if multiple_arrays(values.data_type()) {
+ filter_builder = filter_builder.optimize();
Review Comment:
I think the rationale is that "optimizing" the filter requires looking at
the BooleanArray which itself requires non trivial time, so for certain
operations the overhead of figuring out a better filter strategy takes more
time than actually running it
This is basically the same algorithm used in `filter_record_batch`:
https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#179-183
Users who want to always optimize can use a `FilterBuilder` and explicitly
call `optimize`
https://docs.rs/arrow/latest/arrow/compute/struct.FilterBuilder.html#method.optimize
I made a PR to try and clarify this in the docs
https://github.com/apache/arrow-rs/pull/6317
##########
arrow-array/src/cast.rs:
##########
@@ -815,6 +815,14 @@ pub trait AsArray: private::Sealed {
self.as_struct_opt().expect("struct array")
}
+ /// Downcast this to a [`UnionArray`] returning `None` if not possible
Review Comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]