[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828155#comment-15828155
 ] 

Xuefu Zhang edited comment on HIVE-15580 at 1/18/17 2:24 PM:
-------------------------------------------------------------

Hi [~lirui], your understanding is correct.

And yes, groupByKey uses unbounded memory. While Spark can split to disk for 
groupBy, but the spilling has to be at the key/group boundary. In another word, 
one has to have enough memory to hold any given key group  Thus, for a big key 
group, Spark can still run out of memory.

Ref: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html


was (Author: xuefuz):
Hi [~lirui], your understanding is correct.

And yes, groupByKey uses unbounded memory. While Spark can split to disk for 
this, but the spilling has to be at the key/group boundary. For a big key 
group, Spark can still run out of memory.

Ref: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html

> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>
>                 Key: HIVE-15580
>                 URL: https://issues.apache.org/jira/browse/HIVE-15580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to