alamb opened a new issue, #12114: URL: https://github.com/apache/datafusion/issues/12114
### Is your feature request related to a problem or challenge? While reviewing https://github.com/apache/datafusion/pull/11943 from @Rachelint it is becoming clear to me that the hash aggregate code is now pretty sophisticated and I am not sure our testing has kept up. In fact I couldn't come up with a great way to systematically test the new code added in https://github.com/apache/datafusion/pull/11943 Also, the code in https://github.com/apache/datafusion/pull/11627 from @korowa for skipping partial aggregates has a similar problem as it is not invoked There is also code for streaming and partial streaming group by. All this code has unit tests, but I am not confident that all the combinations are checked. For example the code paths are affected by: 1. Sort order of the input 2. partitioning of the input 3. The type of the group keys 2. The number of groups 4. The number of rows in each group 5. The type of the aggregate 6. The number of aggregates 7. If the aggregate supports group aggregation 8. If the groups aggregator supports partial aggregation skipping ### Describe the solution you'd like I would like a more systematic way to test this code to ensure out current code is correct but also to ensure that future changes do not introduce subtle hard to debug regressions / wrong results ### Describe alternatives you've considered What I think would be good would be to: 1. Describe an input data set (e.g. RecordBatches) 2. Run the same query on the same input data set with different configurations (e.g. block size, input sort order, distribution of input blocks, etc) 3. Compare the results and ensure it is the same in all cases Parameters vary: 1. Sort order if the input 2. target block size 2. Number of input partitions 3. memory limit (to force spilling) 4. Shuffled input row distribution across blocks 4. the skipping partial aggregation enabling or not Test cases: 2. Types of the group keys 2. single/multiple column groups 3. Number of groups (low/high cardinality) 4. Different aggregates ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org