sqlwindspeaker opened a new pull request #30836: URL: https://github.com/apache/spark/pull/30836
### What changes were proposed in this pull request? As described in SPARK-33791, to add an option for user to use Hive legacy compatible grouping__id algorithm. ### Why are the changes needed? See this https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup Hive's grouping__id function algorithm changes between < 2.3 and >= 2.3, currently spark works the same with hive > 2.3, but for users from legacy hive (mainly hive 1.x), they may face a big problem of data error when migrating their query directly to spark. ### Does this PR introduce _any_ user-facing change? Yes, for sql like: `select col1, col2, col3, GROUPING__ID, count(*) from ( VALUES ('aaa', '123', 'kkk'), ('aaa', '234', 'kkk'), ('aaa', '234', 'kkk'), ('aaa', '123', 'kkk') ) AS t (col1, col2, col3) group by col1, col2, col3 with cube order by col1, col2, col3` for spark default, the result is: > NULL NULL NULL 7 4 NULL NULL kkk 6 4 NULL 123 NULL 5 2 NULL 123 kkk 4 2 NULL 234 NULL 5 2 NULL 234 kkk 4 2 aaa NULL NULL 3 4 aaa NULL kkk 2 4 aaa 123 NULL 1 2 aaa 123 kkk 0 2 aaa 234 NULL 1 2 aaa 234 kkk 0 2 when hive legacy mode is enabled, the result is: > NULL NULL NULL 0 4 NULL NULL kkk 4 4 NULL 123 NULL 2 2 NULL 123 kkk 6 2 NULL 234 NULL 2 2 NULL 234 kkk 6 2 aaa NULL NULL 1 4 aaa NULL kkk 5 4 aaa 123 NULL 3 2 aaa 123 kkk 7 2 aaa 234 NULL 3 2 aaa 234 kkk 7 2 ### How was this patch tested? Test cases are added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
