[ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827470#comment-15827470 ]
Xuefu Zhang commented on HIVE-15580: ------------------------------------ [~Ferd], Functionally, I don't see anything bad because groupByKey was used in Hive for aggregation. Hive's groupby operator is able to process one row at a time with this patch. Performance wise, I'm not sure if this will improve or degrade. That depends on the performance difference of groupByKey() + value iterator and repartitionAndSortWithinPartitions() + dummy value iterator. It would be great if you guys can find out. The obvious benefit of this change is that Hive on Spark overcomes the unbounded memory usage of groupByKey(). The patch also solves the problem in HIVE-15527. Please note that this patch is WIP. We will improve it, for example getting ride of the dummy value iterator created per row. I manually ran all spark tests with this patch, and there was only one test failure which needs investigation. > Replace Spark's groupByKey operator with something with bounded memory > ---------------------------------------------------------------------- > > Key: HIVE-15580 > URL: https://issues.apache.org/jira/browse/HIVE-15580 > Project: Hive > Issue Type: Improvement > Components: Spark > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, > HIVE-15580.2.patch, HIVE-15580.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)