[
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liyunzhang_intel updated HIVE-16600:
------------------------------------
Attachment: HIVE-16600.6.patch
[~lirui]:
help review HIVE-16600.5.patch
bq.Please add more tests. E.g. add limit to the sub-query, add order by to
inserts, etc.
I add following test, is this what you mean "add order by to inserts"?
{code}
FROM (select key,value from src order by key) a
INSERT OVERWRITE TABLE e1
SELECT key, value
INSERT OVERWRITE TABLE e2
SELECT key order by key limit 10;
{code}
If yes, it is very interesting that the explain is following, it seems that the
part of e1 is deleted in this case. When you execute the query, you will find
there is no result in e1 in this case.
{code}
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-3 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
Edges:
Reducer 2 <- Map 1 (SORT, 1)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: src
Statistics: Num rows: 500 Data size: 5312 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: key (type: string), value (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 5312 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Statistics: Num rows: 500 Data size: 5312 Basic stats:
COMPLETE Column stats: NONE
TopN Hash Memory Usage: 0.1
Reducer 2
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
Limit
Number of rows: 10
Statistics: Num rows: 10 Data size: 100 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 100 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.e2
Stage: Stage-1
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.e2
Stage: Stage-3
Stats-Aggr Operator
PREHOOK: query: FROM (select key,value from src order by key) a
INSERT OVERWRITE TABLE e1
SELECT key, value
INSERT OVERWRITE TABLE e2
compressed: false
Statistics: Num rows: 10 Data size: 100 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.e2
Stage: Stage-1
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.e2
Stage: Stage-3
Stats-Aggr Operator
{code}
> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
> Issue Type: Sub-task
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch,
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch,
> HIVE-16600.6.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code}
> the parallelism of Sort is 1 even we enable parallel order
> by("hive.optimize.sampling.orderby" is set as "true"). This is not
> reasonable because the parallelism should be calcuated by
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false
> when [children size of
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
> is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
> TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)