Qiang.Kang created HIVE-23356:
---------------------------------

             Summary: Hash aggregation is always disabled while processing 
querys with grouping sets expressions.
                 Key: HIVE-23356
                 URL: https://issues.apache.org/jira/browse/HIVE-23356
             Project: Hive
          Issue Type: Bug
          Components: Hive, Operators, Query Processor
    Affects Versions: All Versions
            Reporter: Qiang.Kang
            Assignee: Qiang.Kang


Hi, we find that many MapReduce map attempts disable hash aggregation while 
processing GroupBy Operator with grouping sets expressions. Logs are just like 
this:
`Disable Hash Aggr: #hash table = 104427 #total = 100000 reduction = 1.0 
minReduction = 0.5`, which means hashTable size is even larger than the number 
of input rows.

Dig deeper we can find that GroupBy operator disables hash aggregation just 
based on the number of input rows instead of rows actually processed. For 
GroupBy Operator with grouping sets (such as: roll up, cube, etc.), 
`numberRowsProcessed` equals `groupingSets.size()*numberRowsInput`, causing 
hashTableSize much larger than querys without grouping sets expressions.

Let's consider a example below:
- Query:
SELECT col1, col2, MAX(col0) FROM tbl GROUP BY col1, col2 GROUPING SETS 
(col1,col2);
- Input data:
col0, col1, col2
[0,-1,-1]
[1,-1,-1]
[2,-1,-1]
[3,3,3]
[4,4,4]

- Configuration:
hive.groupby.mapaggr.checkinterval = 1;
minReductionHashAggr = 0.5;

- Currently, there are 10 rows in partial aggregate result. (hash aggregation 
is disabled very early):
[-1, null, 0]
[null, -1, 0]
[-1, null, 1]
[null, -1, 1]
[-1, null, 2]
[null, -1, 2]
[3, null, 3]
[null, 3, 3]
[4, null, 4]
[null, 4, 4]

The patch below disables hash aggregation based on the number of rows actually 
processed, reducing partial result entries a lot and improving the performance 
of aggregation. There are only 6 rows in partial aggregate result:

[-1, null, 2]
[null, -1, 2]
[null, 3, 3]
[4, null, 4]
[null, 4, 4]
[3, null, 3]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to