Xuefu Zhang commented on HIVE-20141:

[~stakiar] Based on our benchmarking at Uber, groupByKey does offer better 
performance in certain cases, specifically, in aggregation without ordering. 
The difference is about 10%. I understand the limitation with group-by, which 
is why this configuration exists. I don't feel it's compelling enough to change 
the default behavior from either the perf or b/c point of view. The 
configuration has existed for a few releases already, and most of the users 
doesn't have to bother with it anyway.

The best approach is to enhance groupbykey or provide a new shuffle mode that 
overcomes the memory limitation while maintaining the benefit of not enforcing 
ordering in keys. I saw you created an JIRA for that, looking forward to 
progress on that.

> Turn hive.spark.use.groupby.shuffle off by default
> --------------------------------------------------
>                 Key: HIVE-20141
>                 URL: https://issues.apache.org/jira/browse/HIVE-20141
>             Project: Hive
>          Issue Type: Task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
> [~xuefuz] any thoughts on this? I think it would provide better out of the 
> box behavior for Hive-on-Spark users, especially for users who are migrating 
> from Hive-on-MR to HoS. Wondering what your experience with this config has 
> been?
> I've done a bunch of performance profiling with this config turned on vs. 
> off, and for TPC-DS queries it doesn't make a significant difference. The 
> main difference I can see is that when a Spark stage has to spill to disk, 
> {{repartitionAndSortWithinPartitions}} spills more data to disk than 
> {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores 
> everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single 
> copy of the key for potentially multiple values) whereas 
> {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which 
> sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results 
> in more data being spilled to disk).
> My understanding is that using {{repartitionAndSortWithinPartitions}} for 
> Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config 
> would provide a similar experience to HoMR. Furthermore, last I checked, 
> {{groupByKey}} still can't spill within a row group.

This message was sent by Atlassian JIRA

Reply via email to