Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Xuefu Zhang Fri, 20 Jan 2017 11:02:57 -0800


> On Jan. 20, 2017, 6:26 p.m., Chao Sun wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java, line 
> > 31
> > <https://reviews.apache.org/r/55776/diff/1/?file=1610799#file1610799line31>
> >
> >     Is it possible that `numPartitions` equals to 0?


No. If partition number is zero, that means no partition. Then we will not even 
get here. Nevertheless, if it's set to 0, we take 1 instead.


> On Jan. 20, 2017, 6:26 p.m., Chao Sun wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java, line 
> > 34
> > <https://reviews.apache.org/r/55776/diff/1/?file=1610799#file1610799line34>
> >
> >     I wonder whether this also has some extra cost comparing to the 
> > original `groupByKey`, since it needs to sort all records by key in a 
> > single partition, right?

Well, we don't know which one performs better yet. 
repartitionAndSortWithinPartitions() brings extra softing, but it eliminates 
grouping in groupByKey(). Also, groupByKey() has unbounded memory usage, which 
is the problem we are tryig to solve. As described in the JIRA description. We 
will follow up with performance testing, and may provide an option to use 
either groupBy() which might be more performing but w/ unlimitted memory usage 
or the new way where memory usage is bounded.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55776/#review162449
-----------------------------------------------------------


On Jan. 20, 2017, 6:07 p.m., Xuefu Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55776/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2017, 6:07 p.m.)
> 
> 
> Review request for hive, Chao Sun and Rui Li.
> 
> 
> Bugs: HIVE-15580
>     https://issues.apache.org/jira/browse/HIVE-15580
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See JIRA description.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/GroupByShuffler.java 
> e128dd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 
> eeb4443 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java
>  d57cac4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 3d56876 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SortByShuffler.java 
> 997ab7e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 66ffe5d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java
>  0d31e5f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkShuffler.java 40e251f 
>   ql/src/test/queries/clientpositive/union_top_level.q d93fe38 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out b48ab83 
>   ql/src/test/results/clientpositive/spark/lateral_view_explode2.q.out 
> 65a6e3e 
>   ql/src/test/results/clientpositive/spark/union_remove_25.q.out 9fec1d4 
>   ql/src/test/results/clientpositive/spark/union_top_level.q.out c9cb5d3 
>   ql/src/test/results/clientpositive/spark/vector_outer_join5.q.out 9e1742f 
> 
> Diff: https://reviews.apache.org/r/55776/diff/
> 
> 
> Testing
> -------
> 
> All test passed
> 
> 
> Thanks,
> 
> Xuefu Zhang
> 
>

Re: Review Request 55776: Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Reply via email to