[ https://issues.apache.org/jira/browse/HIVE-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321842#comment-16321842 ]
Prasanth Jayachandran commented on HIVE-18359: ---------------------------------------------- [~kgyrtkirk] The assumption around handling empty grouping set (introduced in HIVE-17617) seems to be incorrect. I can see some assumptions to handle the following cases 1) Empty table 2) Non-empty table For case 1), we should emit summary row from only ONE reducer. For picking a single reducer, we can use task ID = 0 logic (although I would prefer non-tez logic, Utilities.getTaskId()). For case 2), processOp already handles summary row and the summary row can end up in any reducer. So closeOp() should never emit summary when atleast 1 row is emitted by mapper. Currently this is broken because it relies on hasOutput flag. Say if we have 3 reducers R0, R1 and R2. Let's say we have a table with 1 column and 1 row with value 'a'. Now if we do a rollup on this column we expect {(a), (null)} with (null) being summary row. Let's based on (a)'s hashcode it ends up in R1 and based on (null) hashcode it ends up in R2. Now reducer R0 will have hasOutput flag set to false as it did not receive any rows, so closeOp of R0 will emit a summary row (it happens that task ID is also 0 in this case). So will end up with duplicate summary in this case (null emitted by R0 and null emitted by R1). This is getting manifested only now because hashcodes have changed for the key (we are using LongWritable for groupingId now). I am guessing this bug will also show up when there are sufficient number of rows and sufficient number of reducers in testcase even without this patch. Does that make sense? Feel free to correct me if my understanding of HIVE-17617 is wrong. > Extend grouping set limits from int to long > ------------------------------------------- > > Key: HIVE-18359 > URL: https://issues.apache.org/jira/browse/HIVE-18359 > Project: Hive > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Attachments: HIVE-18359.1.patch, HIVE-18359.2.patch, > HIVE-18359.3.patch, HIVE-18359.4.patch, HIVE-18359.5.patch > > > Grouping sets is broken for >32 columns because of usage of Int for bitmap > (also GROUPING__ID virtual column). This assumption breaks grouping > sets/rollups/cube when number of participating aggregation columns is >32. > The easier fix would be extend it to Long for now. The correct fix would be > to use BitSets everywhere but that would require GROUPING__ID column type to > binary which will make predicates on GROUPING__ID difficult to deal with. -- This message was sent by Atlassian JIRA (v6.4.14#64029)