[jira] [Created] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

Sahil Takiar (JIRA) Wed, 26 Jul 2017 12:29:29 -0700

Sahil Takiar created HIVE-17178:
-----------------------------------

             Summary: Spark Partition Pruning Sink Operator can't target 
multiple Works
                 Key: HIVE-17178
                 URL: https://issues.apache.org/jira/browse/HIVE-17178
             Project: Hive
          Issue Type: Sub-task
          Components: Spark
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar



A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
if a single table needs to be used to target multiple Map Works.

The following query shows the issue:

{code}
set hive.spark.dynamic.partition.pruning=true;
set hive.auto.convert.join=true;

create table part_table_1 (col int) partitioned by (part_col int);
create table part_table_2 (col int) partitioned by (part_col int);
create table regular_table (col int);

insert into table regular_table values (1);

alter table part_table_1 add partition (part_col=1);
insert into table part_table_1 partition (part_col=1) values (1), (2), (3), (4);

alter table part_table_1 add partition (part_col=2);
insert into table part_table_1 partition (part_col=2) values (1), (2), (3), (4);

alter table part_table_2 add partition (part_col=1);
insert into table part_table_2 partition (part_col=1) values (1), (2), (3), (4);

alter table part_table_2 add partition (part_col=2);
insert into table part_table_2 partition (part_col=2) values (1), (2), (3), (4);

explain select * from regular_table, part_table_1, part_table_2 where 
regular_table.col = part_table_1.part_col and regular_table.col = 
part_table_2.part_col;
{code}

The explain plan is

{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: regular_table
                  Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: col is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: col (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: int)
                          1 _col1 (type: int)
                          2 _col1 (type: int)
                      Select Operator
                        expressions: _col0 (type: int)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: part_col
                            Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: part_col
                            target work: Map 2
                      Select Operator
                        expressions: _col0 (type: int)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: part_col
                            Statistics: Num rows: 1 Data size: 1 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: part_col
                            target work: Map 3
            Local Work:
              Map Reduce Local Work
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: part_table_2
                  Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: col (type: int), part_col (type: int)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 _col0 (type: int)
                        1 _col1 (type: int)
                        2 _col1 (type: int)
                    Select Operator
                      expressions: _col1 (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 8 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 8 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE
                        Spark Partition Pruning Sink Operator
                          partition key expr: part_col
                          Statistics: Num rows: 8 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE
                          target column name: part_col
                          target work: Map 2
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: part_table_1
                  Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: col (type: int), part_col (type: int)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 0 to 2
                      keys:
                        0 _col0 (type: int)
                        1 _col1 (type: int)
                        2 _col1 (type: int)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4
                      input vertices:
                        0 Map 1
                        2 Map 3
                      Statistics: Num rows: 17 Data size: 17 Basic stats: 
COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 17 Data size: 17 Basic stats: 
COMPLETE Column stats: NONE
                        table:
                            input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

The DPP subtrees on Map 1 are exactly the same. We should be able to combine 
them, which avoids doing duplicate work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

Reply via email to