[jira] [Commented] (HIVE-8701) Combine nested map joins into the parent map join if possible [Spark Branch]

Szehon Ho (JIRA) Wed, 19 Nov 2014 18:12:13 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218898#comment-14218898
 ]


Szehon Ho commented on HIVE-8701:
---------------------------------

Hi Suhas, sure the plan I see looks like this, for a modified plan of 
auto_join2 that forces mapjoin to be in the same operator:

{noformat}
STAGE DEPENDENCIES:
  Stage-3 is a root stage
  Stage-1 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-3
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: src2
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: 
COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 250 Data size: 2656 Basic stats: 
COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      condition expressions:
                        0 {key}
                        1 
                      keys:
                        0 key (type: string)
                        1 key (type: string)
            Local Work:
              Map Reduce Local Work
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: smalltable
                  Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL 
Column stats: NONE
                  Filter Operator
                    predicate: UDFToDouble(key) is not null (type: boolean)
                    Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
                    Spark HashTable Sink Operator
                      condition expressions:
                        0 {_col0} {_col5}
                        1 {key}
                      keys:
                        0 (_col0 + _col5) (type: double)
                        1 UDFToDouble(key) (type: double)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: src1
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: 
COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: key is not null (type: boolean)
                    Statistics: Num rows: 250 Data size: 2656 Basic stats: 
COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {key}
                        1 {key}
                      keys:
                        0 key (type: string)
                        1 key (type: string)
                      outputColumnNames: _col0, _col5
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 275 Data size: 2921 Basic stats: 
COMPLETE Column stats: NONE
                      Filter Operator
                        predicate: (_col0 + _col5) is not null (type: boolean)
                        Statistics: Num rows: 138 Data size: 1465 Basic stats: 
COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          condition expressions:
                            0 {_col0} {_col5}
                            1 {key}
                          keys:
                            0 (_col0 + _col5) (type: double)
                            1 UDFToDouble(key) (type: double)
                          outputColumnNames: _col0, _col5, _col10
                          input vertices:
                            1 Map 3
                          Statistics: Num rows: 151 Data size: 1611 Basic 
stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col0 (type: string), _col5 (type: 
string), _col10 (type: string)
                            outputColumnNames: _col0, _col1, _col2
                            Statistics: Num rows: 151 Data size: 1611 Basic 
stats: COMPLETE Column stats: NONE
                            File Output Operator
                              compressed: false
                              Statistics: Num rows: 151 Data size: 1611 Basic 
stats: COMPLETE Column stats: NONE
                              table:
                                  input format: 
org.apache.hadoop.mapred.TextInputFormat
                                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{noformat}

The issue is there are two mapjoins in the same work, which is actually good 
most of time, but we should make sure we don't overwhelm the executor memory in 
that case.  Check should be straight-forward in theory, just to include also 
size of any parent mapjoin that is directly connected (ie, no RS or HTS) in the 
calculation of table size.

> Combine nested map joins into the parent map join if possible [Spark Branch]
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-8701
>                 URL: https://issues.apache.org/jira/browse/HIVE-8701
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Szehon Ho
>
> With the work in HIVE-8616 enabled, the generated plan shows that the nested 
> map join operator isn't merged to its parent when possible. This is 
> demonstrated in auto_join2.q. The MR plan shown that this optimization is in 
> place. We should do the same for Spark.
> {code}
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Map 2 <- Map 3 (NONE, 0)
>         Map 3 <- Map 1 (NONE, 0)
>       DagName: xzhang_20141102074141_ac089634-bf01-4386-b1cf-3e7f2e99f6eb:3
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: src2
>                   Statistics: Num rows: 58 Data size: 5812 Basic stats: 
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: key is not null (type: boolean)
>                     Statistics: Num rows: 29 Data size: 2906 Basic stats: 
> COMPLETE Column stats: NONE
>                     Reduce Output Operator
>                       key expressions: key (type: string)
>                       sort order: +
>                       Map-reduce partition columns: key (type: string)
>                       Statistics: Num rows: 29 Data size: 2906 Basic stats: 
> COMPLETE Column stats: NONE
>         Map 2 
>             Map Operator Tree:
>                 TableScan
>                   alias: src3
>                   Statistics: Num rows: 29 Data size: 5812 Basic stats: 
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: UDFToDouble(key) is not null (type: boolean)
>                     Statistics: Num rows: 15 Data size: 3006 Basic stats: 
> COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {_col0}
>                         1 {value}
>                       keys:
>                         0 (_col0 + _col5) (type: double)
>                         1 UDFToDouble(key) (type: double)
>                       outputColumnNames: _col0, _col11
>                       input vertices:
>                         0 Map 3
>                       Statistics: Num rows: 17 Data size: 1813 Basic stats: 
> COMPLETE Column stats: NONE
>                       Select Operator
>                         expressions: _col0 (type: string), _col11 (type: 
> string)
>                         outputColumnNames: _col0, _col1
>                         Statistics: Num rows: 17 Data size: 1813 Basic stats: 
> COMPLETE Column stats: NONE
>                         File Output Operator
>                           compressed: false
>                           Statistics: Num rows: 17 Data size: 1813 Basic 
> stats: COMPLETE Column stats: NONE
>                           table:
>                               input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                               output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                               serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Map 3 
>             Map Operator Tree:
>                 TableScan
>                   alias: src1
>                   Statistics: Num rows: 58 Data size: 5812 Basic stats: 
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: key is not null (type: boolean)
>                     Statistics: Num rows: 29 Data size: 2906 Basic stats: 
> COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {key}
>                         1 {key}
>                       keys:
>                         0 key (type: string)
>                         1 key (type: string)
>                       outputColumnNames: _col0, _col5
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 31 Data size: 3196 Basic stats: 
> COMPLETE Column stats: NONE
>                       Filter Operator
>                         predicate: (_col0 + _col5) is not null (type: boolean)
>                         Statistics: Num rows: 16 Data size: 1649 Basic stats: 
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           key expressions: (_col0 + _col5) (type: double)
>                           sort order: +
>                           Map-reduce partition columns: (_col0 + _col5) 
> (type: double)
>                           Statistics: Num rows: 16 Data size: 1649 Basic 
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: string)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8701) Combine nested map joins into the parent map join if possible [Spark Branch]

Reply via email to