[
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027197#comment-16027197
]
liyunzhang_intel commented on HIVE-16600:
-----------------------------------------
[~lirui]:
i found that following example does not suit your algorithms
{code}
set hive.execution.engine=spark;
set hive.auto.convert.join.noconditionaltask.size=20;
set hive.spark.dynamic.partition.pruning=true;
set hive.exec.reducers.bytes.per.reducer=256;
set hive.optimize.sampling.orderby=true;
explain select * from srcpart join (select * from srcpart_date_hour order by
hour) srcpart_date_hour1 on (srcpart.ds = srcpart_date_hour1.ds and srcpart.hr
= srcpart_date_hour1.hr) where srcpart_date_hour1.`date` = '2008-04-08' and
srcpart_date_hour1.hour = 11;
{code}
the logical plan
{code}
TS[0]-FIL[15]-SEL[1]-RS[2]-SEL[3]-RS[7]-JOIN[8]-SEL[10]-FS[11]
-SEL[17]-GBY[18]-SPARKPRUNINGSINK[19]
-SEL[20]-GBY[21]-SPARKPRUNINGSINK[22]
TS[4]-RS[5]-JOIN[8]
{code}
{noformat}
the RS[2] is order which should enable parallel order by. But because SEL[3]
has 3 children (RS[7],SEL[17],SEL[20]) and isMultiInsert is false, this cause
RS[2] will not be enabled parallel order by.
{noformat}
> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
> Issue Type: Sub-task
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch,
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch,
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch,
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code}
> the parallelism of Sort is 1 even we enable parallel order
> by("hive.optimize.sampling.orderby" is set as "true"). This is not
> reasonable because the parallelism should be calcuated by
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false
> when [children size of
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
> is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
> TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)