alamb commented on issue #9586: URL: https://github.com/apache/arrow-datafusion/issues/9586#issuecomment-1999356266
I have a theory about what is wrong I think this change from https://github.com/apache/arrow-datafusion/pull/9234 https://github.com/apache/arrow-datafusion/pull/9234/files#diff-a292ca8deeaaff7ba19bad4adf609c476ff383db56249f75cb6aeab77e887744R245-R248 effectively skips all but the first intermediate result when combining data together Here is the code on main, specificaly, that I think should look at all elements in the array: https://github.com/apache/arrow-datafusion/blob/3c26e597aeacde0a5e6a51f30394d3d31c6acd96/datafusion/physical-expr/src/aggregate/count_distinct/mod.rs#L256 So to find a reproducer it would need: 1. A `GROUP BY` 1. More than 1 target partition (to trigger repartitioned group by) 2. The input partitions don't have all the distinct values I think the best idea here is to make a fuzz test that triggers the issue (randomly send inputs). We can potentially follow the model here: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs I will try to do so later today if no one beats me to it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
