[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

liyunzhang (JIRA) Fri, 27 Oct 2017 02:12:50 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221989#comment-16221989
 ]


liyunzhang commented on HIVE-17486:
-----------------------------------

mapjoin.q
{code}
set hive.mapred.mode=nonstrict;
set hive.explain.user=false;
set hive.auto.convert.join=true;
explain
select src1.key, src1.cnt1, src2.cnt1 from
(
  select key, count(*) as cnt1 from 
  (
    select a.key as key, a.value as val1, b.value as val2 from tbl1 a join tbl2 
b on a.key = b.key
  ) subq1 group by key
) src1
join
(
  select key, count(*) as cnt1 from 
  (
    select a.key as key, a.value as val1, b.value as val2 from tbl1 a join tbl2 
b on a.key = b.key
  ) subq2 group by key
) src2
on src1.key = src2.key
{code}

before shared work optimization, the physical plan in Tez
{code}
TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]-GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
TS[3]-FIL[42]-SEL[5]-RS[7]-MAPJOIN[45]
TS[14]-FIL[43]-SEL[16]-MAPJOIN[46]-GBY[24]-RS[25]-GBY[26]-RS[29]-MAPJOIN[47]
TS[17]-FIL[44]-SEL[19]-RS[21]-MAPJOIN[46]
{code}

after the optimization
{code}

TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]-GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
                                        -RS[25]-GBY[26]-RS[29]-MAPJOIN[47]
TS[3]-FIL[42]-SEL[5]-RS[7]-MAPJOIN[45]
{code}

so the tez explain 
before
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: root_20171027044107_4977290a-6856-41c5-b0e3-24b4ec32c59e:1
      Edges:
        Map 1 <- Map 3 (BROADCAST_EDGE)
        Map 4 <- Map 6 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 5 (BROADCAST_EDGE)
        Reducer 5 <- Map 4 (SIMPLE_EDGE)
      DagName: root_20171027044107_4977290a-6856-41c5-b0e3-24b4ec32c59e:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        HybridGraceHashJoin: true
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 6
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        HybridGraceHashJoin: true
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0, _col1, _col3
                  input vertices:
                    1 Reducer 5
                  Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                  HybridGraceHashJoin: true
                  Select Operator
                    expressions: _col0 (type: int), _col3 (type: bigint), _col1 
(type: bigint)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 5 Data size: 38 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Reducer 5 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                  value expressions: _col1 (type: bigint)

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}

After
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: root_20171027043345_8893c266-bde2-4f23-9914-0dfccc8be5ea:1
      Edges:
        Map 1 <- Map 4 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 3 (BROADCAST_EDGE)
        Reducer 3 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20171027043345_8893c266-bde2-4f23-9914-0dfccc8be5ea:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 4
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        HybridGraceHashJoin: true
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0, _col1, _col3
                  input vertices:
                    1 Reducer 3
                  Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                  HybridGraceHashJoin: true
                  Select Operator
                    expressions: _col0 (type: int), _col3 (type: bigint), _col1 
(type: bigint)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 5 Data size: 38 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Reducer 3 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                  value expressions: _col1 (type: bigint)

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}

We can see that there are only 2 Maps(Map1 and Map4) in the after explain (in 
before explain, there are 4 Maps).
When i tried to change the physical plan for HOS like what HOT does. I found 
change of the optimization  
before
{code}
STAGE DEPENDENCIES:
  Stage-3 is a root stage
  Stage-2 depends on stages: Stage-3
  Stage-4 depends on stages: Stage-2
  Stage-1 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-3
    Spark
      DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:4
      Vertices:
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-2
    Spark
      Edges:
        Reducer 5 <- Map 4 (GROUP, 1)
      DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:2
      Vertices:
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 6
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
            Local Work:
              Map Reduce Local Work
        Reducer 5 
            Local Work:
              Map Reduce Local Work
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Spark HashTable Sink Operator
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-4
    Spark
      DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:3
      Vertices:
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: root_20171027045349_0dc8e597-ffad-4da7-bbfc-c2b7533d9b58:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
            Local Work:
              Map Reduce Local Work
        Reducer 2 
            Local Work:
              Map Reduce Local Work
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0, _col1, _col3
                  input vertices:
                    1 Reducer 5
                  Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col3 (type: bigint), _col1 
(type: bigint)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 5 Data size: 38 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


{code}

After
{code}
STAGE DEPENDENCIES:
  Stage-3 is a root stage
  Stage-2 depends on stages: Stage-3
  Stage-4 depends on stages: Stage-2
  Stage-1 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-3
    Spark
      DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:3
      Vertices:
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-2
    Spark
      Edges:
        Reducer 3 <- Map 6 (GROUP, 1)
      DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:2
      Vertices:
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 4
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
            Local Work:
              Map Reduce Local Work
        Reducer 3 
            Local Work:
              Map Reduce Local Work
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Spark HashTable Sink Operator
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-4
    Spark
      DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:4
      Vertices:
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 5 (GROUP, 1)
      DagName: root_20171027045324_b7f21ae7-9bcc-4077-8577-e6c6747dfcff:1
      Vertices:
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 10 Data size: 70 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: key (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 10 Data size: 70 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 4
                        Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          aggregations: count()
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 11 Data size: 77 Basic stats: 
COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
            Local Work:
              Map Reduce Local Work
        Reducer 2 
            Local Work:
              Map Reduce Local Work
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5 Data size: 35 Basic stats: COMPLETE 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)
                  outputColumnNames: _col0, _col1, _col3
                  input vertices:
                    1 Reducer 3
                  Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col3 (type: bigint), _col1 
(type: bigint)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 5 Data size: 38 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 5 Data size: 38 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}
only the Map about table {{b}} are merged to 1 Map, there are still two Maps 
about table {{a}}(Map1,Map4).

The reason causes this is because 
[GenSparkWork|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L421]
 will split following physical operator tree once encounting RS
{code}
TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]-GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
                                        -RS[25]-GBY[26]-RS[29]-MAPJOIN[47]
TS[3]-FIL[42]-SEL[5]-RS[7]-MAPJOIN[45]
{code}  to 
{code}
Map1:TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[11]
Reduce2:GBY[12]-MAPJOIN[47]-SEL[31]-FS[32]
Map3:TS[0]-FIL[41]-SEL[2]-MAPJOIN[45]-GBY[10]-RS[25]
Reduce4:GBY[26]-RS[29]
Map5:TS[3]-FIL[42]-SEL[5]-RS[7]
{code}



> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

Reply via email to