[ 
https://issues.apache.org/jira/browse/HIVE-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321842#comment-16321842
 ] 

Prasanth Jayachandran commented on HIVE-18359:
----------------------------------------------

[~kgyrtkirk] The assumption around handling empty grouping set (introduced in 
HIVE-17617) seems to be incorrect. I can see some assumptions to handle the 
following cases
1) Empty table
2) Non-empty table

For case 1), we should emit summary row from only ONE reducer. For picking a 
single reducer, we can use task ID = 0 logic (although I would prefer non-tez 
logic, Utilities.getTaskId()).

For case 2), processOp already handles summary row and the summary row can end 
up in any reducer. So closeOp() should never emit summary when atleast 1 row is 
emitted by mapper. Currently this is broken because it relies on hasOutput 
flag. Say if we have 3 reducers R0, R1 and R2. Let's say we have a table with 1 
column and 1 row with value 'a'. Now if we do a rollup on this column we expect 
{(a), (null)} with (null) being summary row. Let's based on (a)'s hashcode it 
ends up in R1 and based on (null) hashcode it ends up in R2. Now reducer R0 
will have hasOutput flag set to false as it did not receive any rows, so 
closeOp of R0 will emit a summary row (it happens that task ID is also 0 in 
this case). So will end up with duplicate summary in this case (null emitted by 
R0 and null emitted by R1). 

This is getting manifested only now because hashcodes have changed for the key 
(we are using LongWritable for groupingId now). I am guessing this bug will 
also show up when there are sufficient number of rows and sufficient number of 
reducers in testcase even without this patch.

Does that make sense? Feel free to correct me if my understanding of HIVE-17617 
is wrong. 


> Extend grouping set limits from int to long
> -------------------------------------------
>
>                 Key: HIVE-18359
>                 URL: https://issues.apache.org/jira/browse/HIVE-18359
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-18359.1.patch, HIVE-18359.2.patch, 
> HIVE-18359.3.patch, HIVE-18359.4.patch, HIVE-18359.5.patch
>
>
> Grouping sets is broken for >32 columns because of usage of Int for bitmap 
> (also GROUPING__ID virtual column). This assumption breaks grouping 
> sets/rollups/cube when number of participating aggregation columns is >32. 
> The easier fix would be extend it to Long for now. The correct fix would be 
> to use BitSets everywhere but that would require GROUPING__ID column type to 
> binary which will make predicates on GROUPING__ID difficult to deal with. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to