alamb commented on issue #9586:
URL: 
https://github.com/apache/arrow-datafusion/issues/9586#issuecomment-1999356266

   I have a theory about what is wrong 
   
   I think this change from https://github.com/apache/arrow-datafusion/pull/9234
   
   
https://github.com/apache/arrow-datafusion/pull/9234/files#diff-a292ca8deeaaff7ba19bad4adf609c476ff383db56249f75cb6aeab77e887744R245-R248
   
   effectively skips all but the first intermediate result when combining data 
together
   
   Here is the code on main, specificaly, that I think should look at all 
elements in the array:
   
   
https://github.com/apache/arrow-datafusion/blob/3c26e597aeacde0a5e6a51f30394d3d31c6acd96/datafusion/physical-expr/src/aggregate/count_distinct/mod.rs#L256
   
   So to find a reproducer it would need:
   1. A `GROUP BY`
   1. More than 1 target partition (to trigger repartitioned group by)
   2. The input partitions don't have all the distinct values
   
   I think the best idea here is to make a fuzz test that triggers the issue 
(randomly send inputs). We can potentially follow the model here:
   
   
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs
   
   I will try to do so later today if no one beats me to it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to