[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827470#comment-15827470
 ] 

Xuefu Zhang commented on HIVE-15580:
------------------------------------

[~Ferd], Functionally, I don't see anything bad because groupByKey was used in 
Hive for aggregation. Hive's groupby operator is able to process one row at a 
time with this patch. Performance wise, I'm not sure if this will improve or 
degrade. That depends on the performance difference of groupByKey() + value 
iterator and repartitionAndSortWithinPartitions() + dummy value iterator. It 
would be great if you guys can find out.

The obvious benefit of this change is that Hive on Spark overcomes the 
unbounded memory usage of groupByKey(). The patch also solves the problem in 
HIVE-15527.

Please note that this patch is WIP. We will improve it, for example getting 
ride of the dummy value iterator created per row.

I manually ran all spark tests with this patch, and there was only one test 
failure which needs investigation.

> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>
>                 Key: HIVE-15580
>                 URL: https://issues.apache.org/jira/browse/HIVE-15580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to