[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828143#comment-15828143 ]
Rui Li commented on HIVE-15580: ------------------------------- Hi [~xuefuz], I'd like to check my understanding too. Before the patch, we have 3 kinds of shuffle: groupByKey, sortByKey and repartitionAndSortWithinPartitions. For the last two, we do the grouping ourselves (because reducer expects <Key, Iterator<Value>>). This grouping uses unbounded memory, which is the root cause of HIVE-15527. With the patch, we'll replace groupByKey with repartitionAndSortWithinPartitions. And we don't have to do the grouping ourselves because GBY operator will do that for us. Is this correct? BTW, is there any doc indicating Spark's groupByKey uses unbounded memory? I think Spark can spill the shuffled data to disk if it's too large. > Replace Spark's groupByKey operator with something with bounded memory > ---------------------------------------------------------------------- > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)