[GitHub] [spark] sqlwindspeaker opened a new pull request #30836: [SPARK-33791] Support hive legacy grouping id algorithm

GitBox Thu, 17 Dec 2020 23:07:32 -0800


sqlwindspeaker opened a new pull request #30836:
URL: https://github.com/apache/spark/pull/30836



   ### What changes were proposed in this pull request?
   
   As described in SPARK-33791, to add an option for user to use Hive legacy 
compatible grouping__id algorithm.
   
   
   ### Why are the changes needed?
   See this 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
   
   Hive's grouping__id function algorithm changes between < 2.3 and >= 2.3, 
currently spark works the same with hive > 2.3, but for users from legacy hive 
(mainly hive 1.x), they may face a big problem of data error when migrating 
their query directly to spark.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes,  for sql like:
   
   `select col1, col2, col3, GROUPING__ID, count(*) 
   from (
   VALUES ('aaa', '123', 'kkk'), ('aaa', '234', 'kkk'), ('aaa', '234', 'kkk'), 
('aaa', '123', 'kkk') 
   ) AS t (col1, col2, col3) 
   group by col1, col2, col3 
   with cube 
   order by col1, col2, col3`
   
   for spark default, the result is: 
   
   > NULL    NULL    NULL    7       4
   NULL    NULL    kkk     6       4
   NULL    123     NULL    5       2
   NULL    123     kkk     4       2
   NULL    234     NULL    5       2
   NULL    234     kkk     4       2
   aaa     NULL    NULL    3       4
   aaa     NULL    kkk     2       4
   aaa     123     NULL    1       2
   aaa     123     kkk     0       2
   aaa     234     NULL    1       2
   aaa     234     kkk     0       2
   
   when hive legacy mode is enabled, the result is:
   
   > NULL    NULL    NULL    0       4
   NULL    NULL    kkk     4       4
   NULL    123     NULL    2       2
   NULL    123     kkk     6       2
   NULL    234     NULL    2       2
   NULL    234     kkk     6       2
   aaa     NULL    NULL    1       4
   aaa     NULL    kkk     5       4
   aaa     123     NULL    3       2
   aaa     123     kkk     7       2
   aaa     234     NULL    3       2
   aaa     234     kkk     7       2
   
   
   ### How was this patch tested?
   Test cases are added


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sqlwindspeaker opened a new pull request #30836: [SPARK-33791] Support hive legacy grouping id algorithm

Reply via email to