[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828143#comment-15828143
 ] 

Rui Li commented on HIVE-15580:
-------------------------------

Hi [~xuefuz], I'd like to check my understanding too. Before the patch, we have 
3 kinds of shuffle: groupByKey, sortByKey and 
repartitionAndSortWithinPartitions. For the last two, we do the grouping 
ourselves (because reducer expects <Key, Iterator<Value>>). This grouping uses 
unbounded memory, which is the root cause of HIVE-15527.

With the patch, we'll replace groupByKey with 
repartitionAndSortWithinPartitions. And we don't have to do the grouping 
ourselves because GBY operator will do that for us. Is this correct?
BTW, is there any doc indicating Spark's groupByKey uses unbounded memory? I 
think Spark can spill the shuffled data to disk if it's too large.

> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>
>                 Key: HIVE-15580
>                 URL: https://issues.apache.org/jira/browse/HIVE-15580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to