[
https://issues.apache.org/jira/browse/HIVE-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858605#comment-15858605
]
Chao Sun commented on HIVE-15683:
---------------------------------
+1
> Make what's done in HIVE-15580 for group by configurable
> --------------------------------------------------------
>
> Key: HIVE-15683
> URL: https://issues.apache.org/jira/browse/HIVE-15683
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Affects Versions: 2.2.0
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-15683.1.patch, HIVE-15683.2.patch, HIVE-15683.patch
>
>
> HIVE-15580 changed the way the data is shuffled for group by: instead of
> using Spark's groupByKey to shuffle data, Hive on Spark now uses
> repartitionAndSortWithinPartitions(), which generates (key, value) pairs
> instead of original (key, value iterator). This might have some performance
> implications, but it's needed to get rid of unbound memory usage by
> {{groupByKey}}.
> Here we'd like to compare group by performance with or w/o HIVE-15580. If the
> impact is significant, we can provide a configuration that allows user to
> switch back to the original way of shuffling.
> This work should be ideally done after HIVE-15682 as the optimization there
> should help the performance here as well.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)