[
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liyunzhang_intel updated HIVE-11297:
------------------------------------
Attachment: HIVE-11297.3.patch
[~csun]: update SplitOpTreeForDPP and to split the trees like what you
mentioned last time.
because the explain plan is changed after this jira
{code}
set hive.execution.engine=spark;
set hive.auto.convert.join.noconditionaltask.size=20;
set hive.spark.dynamic.partition.pruning=true;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds =
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}
before
{code}
STAGE PLANS:
Stage: Stage-2
Spark
#### A masked pattern was here ####
Vertices:
Map 5
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Map 6
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: hr
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
target column name: hr
target work: Map 1
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL
SORT, 2)
Reducer 3 <- Reducer 2 (GROUP, 1)
{code}
now
{code}
Stage: Stage-2 Spark
#### A masked pattern was here ####
Vertices:
Map 5
Map Operator Tree:
TableScan
alias: srcpart_date_hour
filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((date = '2008-04-08') and (UDFToDouble(hour) =
11.0) and ds is not null and hr is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: ds (type: string), hr (type: string)
outputColumnNames: _col0, _col2
Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: ds
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
target column name: ds
target work: Map 1
Select Operator
expressions: _col2 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
Spark Partition Pruning Sink Operator
partition key expr: hr
Statistics: Num rows: 1 Data size: 27 Basic stats:
COMPLETE Column stats: NONE
target column name: hr
target work: Map 1
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL
SORT, 2)
Reducer 3 <- Reducer 2 (GROUP, 1)
{code}
but when i use following command to generate new
spark_dynamic_partition_pruning.q.out
{code}
mvn clean test -Dtest=TestSparkCliDriver -Dtest.output.overwrite=true -Dqfile=
spark_dynamic_partition_pruning.q
{code}
I found it not only changed above explain, but also change others. the changes
like
1. comment like "SORT_QUERY_RESULTS" deleted, how to keep original format?
{code}
-PREHOOK: query: -- SORT_QUERY_RESULTS
-
-select distinct ds from srcpart
+PREHOOK: query: select distinct ds from srcpart
+POSTHOOK: query: EXPLAIN select count(*) from srcpart join srcpart_date on
(srcpart.ds = srcpart_date.ds) join srcpart_hour on (srcpart.hr =
srcpart_hour.hr)
{code}
2. some changes is not caused by HIVE-11297.3.patch. like "filter Operator is
added in the explain plan"
{code}
POSTHOOK: type: QUERY
STAGE DEPENDENCIES:
Stage-2 is a root stage
@@ -3168,16 +3141,19 @@ STAGE PLANS:
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE
Column stats: NONE
- Group By Operator
- keys: _col0 (type: string)
- mode: hash
- outputColumnNames: _col0
- Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE
Column stats: NONE
- Reduce Output Operator
- key expressions: _col0 (type: string)
- sort order: +
- Map-reduce partition columns: _col0 (type: string)
+ Filter Operator
+ predicate: _col0 is not null (type: boolean)
+ Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE
Column stats: NONE
+ Group By Operator
+ keys: _col0 (type: string)
+ mode: hash
+ outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 368 Basic stats:
COMPLETE Column stats: NONE
+ Reduce Output Operator
+ key expressions: _col0 (type: string)
+ sort order: +
+ Map-reduce partition columns: _col0 (type: string)
+ Statistics: Num rows: 2 Data size: 368 Basic stats:
COMPLETE Column stats: NONE
{code}
How to solve the changes in spark.dynamic.partition.pruning.q.out?
1. just copy the change caused by HIVE-11297.3
2. use "-Dtest.output.overwrite=true" to generate a new *q.out
which do you prefer?
> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
> Key: HIVE-11297
> URL: https://issues.apache.org/jira/browse/HIVE-11297
> Project: Hive
> Issue Type: Bug
> Affects Versions: spark-branch
> Reporter: Chao Sun
> Assignee: liyunzhang_intel
> Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch,
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates
> partition info for more than one partition columns, multiple operator trees
> are created, which all start from the same table scan op, but have different
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do
> table scan multiple times.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)