[
https://issues.apache.org/jira/browse/HIVE-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar reassigned HIVE-20141:
-----------------------------------
> Turn hive.spark.use.groupby.shuffle off by default
> --------------------------------------------------
>
> Key: HIVE-20141
> URL: https://issues.apache.org/jira/browse/HIVE-20141
> Project: Hive
> Issue Type: Task
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
>
> [~xuefuz] any thoughts on this? I think it would provide better out of the
> box behavior for Hive-on-Spark users, especially for users who are migrating
> from Hive-on-MR to HoS. Wondering what your experience with this config has
> been?
> I've done a bunch of performance profiling with this config turned on vs.
> off, and for TPC-DS queries it doesn't make a significant difference. The
> main difference I can see is that when a Spark stage has to spill to disk,
> {{repartitionAndSortWithinPartitions}} spills more data to disk than
> {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores
> everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single
> copy of the key for potentially multiple values) whereas
> {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which
> sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results
> in more data being spilled to disk).
> My understanding is that using {{repartitionAndSortWithinPartitions}} for
> Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config
> would provide a similar experience to HoMR. Furthermore, last I checked,
> {{groupByKey}} still can't spill within a row group.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)