[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

liyunzhang_intel (JIRA) Thu, 22 Jun 2017 19:36:06 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060336#comment-16060336
 ]


liyunzhang_intel commented on HIVE-11297:
-----------------------------------------

[~csun]: about the questions you mentioned in RB. there are two queries are 
different.  
explain query1( please use the attached hive-site.xml to verify, without the 
configuration in hive-site.xml,  i can not reproduce following explain)
{code}
set hive.execution.engine=spark; 
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
explain select count(*) from srcpart join srcpart_date on (srcpart.ds = 
srcpart_date.ds) join srcpart_hour on (srcpart.hr = srcpart_hour.hr) 
where srcpart_date.`date` = '2008-04-08' and srcpart_hour.hour = 11 and 
srcpart.hr = 11
{code}
previous explain 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
      Vertices:
        Map 7 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: ds
                            target work: Map 1

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL 
SORT, 2)
        Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2 
(PARTITION-LEVEL SORT, 2)
        Reducer 4 <- Reducer 3 (GROUP, 1)
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
                      value expressions: _col1 (type: string)
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_hour
                  filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) 
= 11.0)) (type: boolean)
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) 
= 11.0)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: hr (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col1
                Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  key expressions: _col1 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col1 (type: string)
                  Statistics: Num rows: 1 Data size: 12786 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 3 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col1 (type: string)
                  1 _col0 (type: string)
                Statistics: Num rows: 1 Data size: 14064 Basic stats: COMPLETE 
Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: _col0 (type: bigint)
        Reducer 4 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}

current explain
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
      Vertices:
        Map 7 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: ds
                            target work: Map 1
                 Map 8 
            Map Operator Tree:
                TableScan
                  alias: srcpart_hour
                  filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) 
= 11.0)) (type: boolean)
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) 
= 11.0)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: hr (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: hr
                            Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: hr
                            target work: Map 1Best Regards


  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL 
SORT, 2)
        Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2 
(PARTITION-LEVEL SORT, 2)
        Reducer 4 <- Reducer 3 (GROUP, 1)
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL 
Column stats: NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 1 Data size: 11624 Basic stats: 
PARTIAL Column stats: NONE
                      value expressions: _col1 (type: string)
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: 
boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) 
(type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 21 Basic stats: 
COMPLETE Column stats: NONE
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_hour
                  filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) 
= 11.0)) (type: boolean)
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) 
= 11.0)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: hr (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col1
                Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  key expressions: _col1 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col1 (type: string)
                  Statistics: Num rows: 1 Data size: 12786 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 3 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col1 (type: string)
                  1 _col0 (type: string)
                Statistics: Num rows: 1 Data size: 14064 Basic stats: COMPLETE 
Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: _col0 (type: bigint)
        Reducer 4 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

the difference between previous and now is an extra Map8.  After quering the 
history of spark_dynamic_partiton_pruning.q.out, I found in {{commit 
42216997f7fcff1853524b03e3961ec5c21f3fd7}}, the Map8 exists. After {{commit 
677e5d20109e31203129ef5090c8989e9bb7c366}} the Map8 is removed.  
The reason that cause the Map8 is removed in {{commit 
677e5d20109e31203129ef5090c8989e9bb7c366}} is because the 
[filterDesc|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java#L128
 ] changed, in  {{commit 677e5d20109e31203129ef5090c8989e9bb7c366}}, filterDesc 
is {noformat}((ds) IN (RS[10]) and true) (type: boolean){noformat} in latest 
version its value is {noformat}Filter: ((ds) IN (RS[5]) and (hr) IN (RS[9])) 
(type: boolean)){noformat}  that's why there is no Map8 in  {{commit 
677e5d20109e31203129ef5090c8989e9bb7c366}}.  In fact, Map8 should exist because 
we should build dpp operator tree for {{srcpart_date}} and {{srcpart_hour}} as 
{{srcpart}} has two partitions(srcpart.ds and srcpart.hr).  But i will not  
deep investigate why Map8 is lost in  {{commit 
677e5d20109e31203129ef5090c8989e9bb7c366}}.

> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch, HIVE-11297.4.patch, HIVE-11297.5.patch, 
> HIVE-11297.6.patch, HIVE-11297.7.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

Reply via email to