Re: [PR] feat: simplify count distinct logical plan [datafusion]

via GitHub Sat, 26 Apr 2025 22:49:41 -0700


jayzhan211 commented on PR #15867:
URL: https://github.com/apache/datafusion/pull/15867#issuecomment-2833182081


   ```rust
       fn state(&mut self) -> Result<Vec<ScalarValue>> {
           let scalars = self.values.iter().cloned().collect::<Vec<_>>();
           let arr =
               ScalarValue::new_list_nullable(scalars.as_slice(), 
&self.state_data_type);
           Ok(vec![ScalarValue::List(arr)])
       }
   ```
   
   We clone the hashset from partial aggregation and convert to List for final 
aggregation. In high cardinality case, where most of the values are different 
we do aggregation twice + additional clone.
   
   I can think of 2 possible solution.
   
   1. Use single aggregation but somehow aggregation parallelly
   we can try convert it single aggregation and see whether it is fast enough 
than the current version
   
   2. Find out a way to avoid cloning hashset to list array and initilize the 
accumulator with the hashset in final aggregation state. This is probably not 
trivial to ensure we have zero copy all the way down to final aggregation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: simplify count distinct logical plan [datafusion]

Reply via email to