[
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060336#comment-16060336
]
liyunzhang_intel commented on HIVE-11297:
-----------------------------------------
[~csun]: about the questions you mentioned in RB. there are two queries are
different.
explain query1( please use the attached hive-site.xml to verify, without the
configuration in hive-site.xml, i can not reproduce following explain)
{code}
set hive.execution.engine=spark;
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
explain select count(*) from srcpart join srcpart_date on (srcpart.ds =
srcpart_date.ds) join srcpart_hour on (srcpart.hr = srcpart_hour.hr)
where srcpart_date.`date` = '2008-04-08' and srcpart_hour.hour = 11 and
srcpart.hr = 11
{code}
previous explain
{code}
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
Vertices:
Map 7
Map Operator Tree:
TableScan
alias: srcpart_date
filterExpr: ((date = '2008-04-08') and ds is not null) (type:
boolean)
Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and ds is not null)
(type: boolean)
Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL
SORT, 2)
Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2
(PARTITION-LEVEL SORT, 2)
Reducer 4 <- Reducer 3 (GROUP, 1)
DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: srcpart
Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 11624 Basic stats:
PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 11624 Basic stats:
PARTIAL Column stats: NONE
value expressions: _col1 (type: string)
Map 5
Map Operator Tree:
TableScan
alias: srcpart_date
filterExpr: ((date = '2008-04-08') and ds is not null) (type:
boolean)
Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and ds is not null)
(type: boolean)
Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Map 6
Map Operator Tree:
TableScan
alias: srcpart_hour
filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr)
= 11.0)) (type: boolean)
Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr)
= 11.0)) (type: boolean)
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: hr (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Reducer 2
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col1
Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
key expressions: _col1 (type: string)
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 1 Data size: 12786 Basic stats:
COMPLETE Column stats: NONE
Reducer 3
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col1 (type: string)
1 _col0 (type: string)
Statistics: Num rows: 1 Data size: 14064 Basic stats: COMPLETE
Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 4
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
current explain
{code}
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
Vertices:
Map 7
Map Operator Tree:
TableScan
alias: srcpart_date
filterExpr: ((date = '2008-04-08') and ds is not null) (type:
boolean)
Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and ds is not null)
(type: boolean)
Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Map 8
Map Operator Tree:
TableScan
alias: srcpart_hour
filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr)
= 11.0)) (type: boolean)
Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr)
= 11.0)) (type: boolean)
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: hr (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: hr
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
target column name: hr
target work: Map 1Best Regards
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL
SORT, 2)
Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2
(PARTITION-LEVEL SORT, 2)
Reducer 4 <- Reducer 3 (GROUP, 1)
DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: srcpart
Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 11624 Basic stats:
PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 11624 Basic stats:
PARTIAL Column stats: NONE
value expressions: _col1 (type: string)
Map 5
Map Operator Tree:
TableScan
alias: srcpart_date
filterExpr: ((date = '2008-04-08') and ds is not null) (type:
boolean)
Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and ds is not null)
(type: boolean)
Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 21 Basic stats:
COMPLETE Column stats: NONE
Map 6
Map Operator Tree:
TableScan
alias: srcpart_hour
filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr)
= 11.0)) (type: boolean)
Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr)
= 11.0)) (type: boolean)
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: hr (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 5 Basic stats:
COMPLETE Column stats: NONE
Reducer 2
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col1
Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
key expressions: _col1 (type: string)
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 1 Data size: 12786 Basic stats:
COMPLETE Column stats: NONE
Reducer 3
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col1 (type: string)
1 _col0 (type: string)
Statistics: Num rows: 1 Data size: 14064 Basic stats: COMPLETE
Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 4
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
the difference between previous and now is an extra Map8. After quering the
history of spark_dynamic_partiton_pruning.q.out, I found in {{commit
42216997f7fcff1853524b03e3961ec5c21f3fd7}}, the Map8 exists. After {{commit
677e5d20109e31203129ef5090c8989e9bb7c366}} the Map8 is removed.
The reason that cause the Map8 is removed in {{commit
677e5d20109e31203129ef5090c8989e9bb7c366}} is because the
[filterDesc|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java#L128
] changed, in {{commit 677e5d20109e31203129ef5090c8989e9bb7c366}}, filterDesc
is {noformat}((ds) IN (RS[10]) and true) (type: boolean){noformat} in latest
version its value is {noformat}Filter: ((ds) IN (RS[5]) and (hr) IN (RS[9]))
(type: boolean)){noformat} that's why there is no Map8 in {{commit
677e5d20109e31203129ef5090c8989e9bb7c366}}. In fact, Map8 should exist because
we should build dpp operator tree for {{srcpart_date}} and {{srcpart_hour}} as
{{srcpart}} has two partitions(srcpart.ds and srcpart.hr). But i will not
deep investigate why Map8 is lost in {{commit
677e5d20109e31203129ef5090c8989e9bb7c366}}.
> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
> Issue Type: Bug
> Affects Versions: spark-branch
> Reporter: Chao Sun
> Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch,
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch,
> HIVE-11297.6.patch, HIVE-11297.7.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates
> partition info for more than one partition columns, multiple operator trees
> are created, which all start from the same table scan op, but have different
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do
> table scan multiple times.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)