Sahil Takiar commented on HIVE-20141:

Thanks for the input [~xuefuz]. Yes, I opened HIVE-20108 to investigate 
alternatives to {{groupByKey}}. Are there certain types of Hive queries that 
cause {{groupByKey}} to hit an OOM? From what I have seen, as long as 
{{hive.map.aggr}} is set to true (which it is by default), then map-side 
aggregation significantly reduces the amount of memory that needs to be 
processed by the reducers.

> Turn hive.spark.use.groupby.shuffle off by default
> --------------------------------------------------
>                 Key: HIVE-20141
>                 URL: https://issues.apache.org/jira/browse/HIVE-20141
>             Project: Hive
>          Issue Type: Task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
> [~xuefuz] any thoughts on this? I think it would provide better out of the 
> box behavior for Hive-on-Spark users, especially for users who are migrating 
> from Hive-on-MR to HoS. Wondering what your experience with this config has 
> been?
> I've done a bunch of performance profiling with this config turned on vs. 
> off, and for TPC-DS queries it doesn't make a significant difference. The 
> main difference I can see is that when a Spark stage has to spill to disk, 
> {{repartitionAndSortWithinPartitions}} spills more data to disk than 
> {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores 
> everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single 
> copy of the key for potentially multiple values) whereas 
> {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which 
> sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results 
> in more data being spilled to disk).
> My understanding is that using {{repartitionAndSortWithinPartitions}} for 
> Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config 
> would provide a similar experience to HoMR. Furthermore, last I checked, 
> {{groupByKey}} still can't spill within a row group.

This message was sent by Atlassian JIRA

Reply via email to