[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827399#comment-15827399 ]
Ferdinand Xu commented on HIVE-15580: ------------------------------------- Hi [~xuefuz], the main change is about replacing *groupByKey* with *repartitionAndSortWithinPartitions*. Just help me to have a better understand. Before this patch: e.g. GroupByShuffle will lead to the following result: K1 -> iterator of {V11,V12,V13...} K2 -> iterator of {V21,V22,V23...} ... With this patch: K1 -> V11 K1 -> V12 K1 -> V13 ... K2 -> V21 ... And we process them one by one without fetching the value from iterator. If so, is there any side effect by changing this? > Replace Spark's groupByKey operator with something with bounded memory > ---------------------------------------------------------------------- > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)