[I] Aggregation fuzz testing [datafusion]

via GitHub Thu, 22 Aug 2024 08:59:52 -0700


alamb opened a new issue, #12114:
URL: https://github.com/apache/datafusion/issues/12114


   ### Is your feature request related to a problem or challenge?
   
   While reviewing  https://github.com/apache/datafusion/pull/11943 from 
@Rachelint it is becoming clear to me that the hash aggregate code is now 
pretty sophisticated and I am not sure our testing has kept up. In fact I 
couldn't come up with a great way to systematically test the new code added in 
https://github.com/apache/datafusion/pull/11943
   
   Also, the code in https://github.com/apache/datafusion/pull/11627 from 
@korowa for skipping partial aggregates has a similar problem as it is not 
invoked  There is also code for streaming and partial streaming group by.
   
   All this code has unit tests, but I am not confident that all the 
combinations are checked. For example the code paths are affected by:
   
   1. Sort order of the input
   2. partitioning of the input
   3. The type of the group keys
   2. The number of groups
   4. The number of rows in each group
   5. The type of the aggregate
   6. The number of aggregates
   7. If the aggregate supports group aggregation
   8. If the groups aggregator supports partial aggregation skipping
   
   
   
   
   
   ### Describe the solution you'd like
   
   I would like a more systematic way to test this code to ensure out current 
code is correct but also to ensure that future changes do not introduce subtle 
hard to debug regressions / wrong results
   
   ### Describe alternatives you've considered
   
   
   What I think would be good would be to:
   1. Describe an input data set (e.g. RecordBatches)
   2. Run the same query on the same input data set with different 
configurations (e.g. block size, input sort order, distribution of input 
blocks, etc)
   3. Compare the results and ensure it is the same in all cases
   
   Parameters vary:
   1. Sort order if the input
   2. target block size
   2. Number of input partitions
   3. memory limit (to force spilling)
   4. Shuffled input row distribution across blocks
   4. the skipping partial aggregation enabling or not
   
   Test cases:
   2. Types of the group keys
   2. single/multiple column groups
   3. Number of groups (low/high cardinality)
   4. Different aggregates
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Aggregation fuzz testing [datafusion]

Reply via email to