[ 
https://issues.apache.org/jira/browse/HIVE-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540599#comment-16540599
 ] 

Sahil Takiar commented on HIVE-20108:
-------------------------------------

Another limitation of {{groupByKey}} is that it can't push any aggregation into 
any of the shuffle logic. What we probably want here is something like 
{{combineByKey}} or {{reduceByKey}}. These functions allow specifying an 
{{Aggregator}} which is called in {{ExternalAppendOnlyMap}} to push aggregation 
into the shuffle-reader (e.g. {{BlockStoreShuffleReader}}). This allows 
aggregating data in memory, which avoids having to spill much data to disk. 
{{groupByKey}} doesn't have this functionality and all data has to be stored in 
the {{ExternalAppendOnlyMap}} before any aggregation by Hive can start, this 
can result in a lot more spilled data.

Getting {{combineByKey}} / {{reduceByKey}} to work with HoS looks tricky, we 
basically have to wrap the {{GroupByOperator}} into a function that Spark can 
call. This should significantly decrease the chance of OOM that has been seen 
in {{groupByKey}} since we are pushing the aggregation as far down as possible 
which results in less data being stored in memory and spilled to disk.

> Investigate alternatives to groupByKey
> --------------------------------------
>
>                 Key: HIVE-20108
>                 URL: https://issues.apache.org/jira/browse/HIVE-20108
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> We use {{groupByKey}} for aggregations (or if 
> {{hive.spark.use.groupby.shuffle}} is false we use 
> {{repartitionAndSortWithinPartitions}}).
> {{groupByKey}} has its drawbacks because it can't spill records within a 
> single key group. It also seems to be doing some unnecessary work in Spark's 
> {{Aggregator}} (not positive about this part).
> {{repartitionAndSortWithinPartitions}} is better, but the sorting within 
> partitions isn't necessary for aggregations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to