Re: [PR] Fix count(null) and count(distinct null) [arrow-datafusion]

via GitHub Tue, 12 Dec 2023 14:32:03 -0800


alamb commented on code in PR #8511:
URL: https://github.com/apache/arrow-datafusion/pull/8511#discussion_r1424643988



##########
datafusion/physical-expr/src/aggregate/count.rs:
##########
@@ -198,16 +198,18 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> 
usize {
     if values.len() > 1 {
         let result_bool_buf: Option<BooleanBuffer> = values
             .iter()
-            .map(|a| a.nulls())
+            .map(|a| a.logical_nulls())

Review Comment:
   I am slightly worried about the need to allocate a new null buffer each 
time, even for arrays when we could just use the exising one
   
   This is particularly concerning given this is on the critical path in 
aggregates
   
   I reviewed the `logical_nulls` method -- 
   
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls 
and I see the issue is that it returns an owned Option
   
   What would you think about implemeting a method in `DataFusion` that avoids 
the copy if it is not necessary, like
   ```rust
   fn logical_nulls(arr: &ArrayRef) -> Cow<'_, Option<BooleanBuffer>> {
     
   }
   ```
   
   That only creates the nulll buffer for `NullArrays`? 
   
   Then we can propose upstreaming that back to arrow-rs to avoid the potential 
performance issue
   
   I know the Cow thing is not always the easiest to make happen -- if you need 
help I can try and find time to help code it up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix count(null) and count(distinct null) [arrow-datafusion]

Reply via email to