[
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828155#comment-15828155
]
Xuefu Zhang edited comment on HIVE-15580 at 1/18/17 2:24 PM:
-------------------------------------------------------------
Hi [~lirui], your understanding is correct.
And yes, groupByKey uses unbounded memory. While Spark can split to disk for
groupBy, but the spilling has to be at the key/group boundary. In another word,
one has to have enough memory to hold any given key group Thus, for a big key
group, Spark can still run out of memory.
Ref:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html
was (Author: xuefuz):
Hi [~lirui], your understanding is correct.
And yes, groupByKey uses unbounded memory. While Spark can split to disk for
this, but the spilling has to be at the key/group boundary. For a big key
group, Spark can still run out of memory.
Ref:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html
> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch,
> HIVE-15580.2.patch, HIVE-15580.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)