[
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002113#comment-16002113
]
liyunzhang_intel commented on HIVE-11297:
-----------------------------------------
the explain plan of the multiple columns single source case in
spark_dynamic_partition_pruning.q is
{code}
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds =
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}
the explain plan is
{code}
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
#### A masked pattern was here ####
Vertices:
Map 5
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Map 6
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: hr
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
target column name: hr
target work: Map 1
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL
SORT, 2)
Reducer 3 <- Reducer 2 (GROUP, 1)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: srcpart
Statistics: Num rows: 2000 Data size: 21248 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2000 Data size: 21248 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type:
string)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1
(type: string)
Statistics: Num rows: 2000 Data size: 21248 Basic stats:
COMPLETE Column stats: NONE
Map 4
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col2 (type:
string)
sort order: ++
Map-reduce partition columns: _col0 (type: string),
_col2 (type: string)
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Reducer 2
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string), _col1 (type: string)
1 _col0 (type: string), _col2 (type: string)
Statistics: Num rows: 2200 Data size: 23372 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 3
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
Comparing Map5,Map6,actually they are similar except Spark Partition Pruning
Sink Operator).
the query in tez, there is only 1 Map(Map4) contains {{Dynamic Partitioning
Event Operator}}
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
#### A masked pattern was here ####
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: srcpart
Statistics: Num rows: 2000 Data size: 757248 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2000 Data size: 736000 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type:
string)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1
(type: string)
Statistics: Num rows: 2000 Data size: 736000 Basic stats:
COMPLETE Column stats: COMPLETE
Execution mode: llap
LLAP IO: no inputs
Map 4
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col2 (type:
string)
sort order: ++
Map-reduce partition columns: _col0 (type: string),
_col2 (type: string)
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Dynamic Partitioning Event Operator
Target column: ds (string)
Target Input: srcpart
Partition key expr: ds
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Target Vertex: Map 1
Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Dynamic Partitioning Event Operator
Target column: hr (string)
Target Input: srcpart
Partition key expr: hr
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Target Vertex: Map 1
Execution mode: llap
LLAP IO: no inputs
Reducer 2
Execution mode: llap
Reduce Operator Tree:
Merge Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string), _col1 (type: string)
1 _col0 (type: string), _col2 (type: string)
Statistics: Num rows: 2200 Data size: 809600 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 3
Execution mode: llap
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
> Issue Type: Bug
> Affects Versions: spark-branch
> Reporter: Chao Sun
> Assignee: liyunzhang_intel
>
> Currently, for dynamic partition pruning in Spark, if a small table generates
> partition info for more than one partition columns, multiple operator trees
> are created, which all start from the same table scan op, but have different
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do
> table scan multiple times.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)