mahesh kumar behera created HIVE-24580:
------------------------------------------

             Summary: Add support for combiner in hash mode group aggregation 
(Support for distinct)
                 Key: HIVE-24580
                 URL: https://issues.apache.org/jira/browse/HIVE-24580
             Project: Hive
          Issue Type: Bug
          Components: Hive
            Reporter: mahesh kumar behera
            Assignee: mahesh kumar behera


In map side group aggregation, partial grouped aggregation is calculated to 
reduce the data written to disk by map task. In case of hash aggregation, where 
the input data is not sorted, hash table is used (with sorting also being 
performed before flushing). If the hash table size increases beyond 
configurable limit, data is flushed to disk and new hash table is generated. If 
the reduction by hash table is less than min hash aggregation reduction 
calculated during compile time, the map side aggregation is converted to 
streaming mode. So if the first few batch of records does not result into 
significant reduction, then the mode is switched to streaming mode. This may 
have impact on performance, if the subsequent batch of records have less number 
of distinct values. 

To improve performance both in Hash and Streaming mode, a combiner can be added 
to the map task after the keys are sorted. This will make sure that the 
aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to