[
https://issues.apache.org/jira/browse/HIVE-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540599#comment-16540599
]
Sahil Takiar commented on HIVE-20108:
-------------------------------------
Another limitation of {{groupByKey}} is that it can't push any aggregation into
any of the shuffle logic. What we probably want here is something like
{{combineByKey}} or {{reduceByKey}}. These functions allow specifying an
{{Aggregator}} which is called in {{ExternalAppendOnlyMap}} to push aggregation
into the shuffle-reader (e.g. {{BlockStoreShuffleReader}}). This allows
aggregating data in memory, which avoids having to spill much data to disk.
{{groupByKey}} doesn't have this functionality and all data has to be stored in
the {{ExternalAppendOnlyMap}} before any aggregation by Hive can start, this
can result in a lot more spilled data.
Getting {{combineByKey}} / {{reduceByKey}} to work with HoS looks tricky, we
basically have to wrap the {{GroupByOperator}} into a function that Spark can
call. This should significantly decrease the chance of OOM that has been seen
in {{groupByKey}} since we are pushing the aggregation as far down as possible
which results in less data being stored in memory and spilled to disk.
> Investigate alternatives to groupByKey
> --------------------------------------
>
> Key: HIVE-20108
> URL: https://issues.apache.org/jira/browse/HIVE-20108
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
>
> We use {{groupByKey}} for aggregations (or if
> {{hive.spark.use.groupby.shuffle}} is false we use
> {{repartitionAndSortWithinPartitions}}).
> {{groupByKey}} has its drawbacks because it can't spill records within a
> single key group. It also seems to be doing some unnecessary work in Spark's
> {{Aggregator}} (not positive about this part).
> {{repartitionAndSortWithinPartitions}} is better, but the sorting within
> partitions isn't necessary for aggregations.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)