[ 
https://issues.apache.org/jira/browse/HIVE-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858605#comment-15858605
 ] 

Chao Sun commented on HIVE-15683:
---------------------------------

+1

> Make what's done in HIVE-15580 for group by configurable
> --------------------------------------------------------
>
>                 Key: HIVE-15683
>                 URL: https://issues.apache.org/jira/browse/HIVE-15683
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.2.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15683.1.patch, HIVE-15683.2.patch, HIVE-15683.patch
>
>
> HIVE-15580 changed the way the data is shuffled for group by: instead of 
> using Spark's groupByKey to shuffle data, Hive on Spark now uses 
> repartitionAndSortWithinPartitions(), which generates (key, value) pairs 
> instead of original (key, value iterator). This might have some performance 
> implications, but it's needed to get rid of unbound memory usage by 
> {{groupByKey}}.
> Here we'd like to compare group by performance with or w/o HIVE-15580. If the 
> impact is significant, we can provide a configuration that allows user to 
> switch back to the original way of shuffling.
> This work should be ideally done after HIVE-15682 as the optimization there 
> should help the performance here as well. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to