[jira] [Commented] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins

Charles Chen (JIRA) Wed, 03 Aug 2011 11:33:55 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078906#comment-13078906
 ]


Charles Chen commented on HIVE-2337:
------------------------------------

Using:
{noformat}
create table t1 (id int, key string, value string);
create table t2 (id int, key string, value string);
create table t3 (id int, key string, value string);
create table t4 (id int, key string, value string);
{noformat}

For example, the query
{noformat}
explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on 
t2.id=t3.id where t3.id=20; 
{noformat}
currently gives
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t1 
          TableScan
            alias: t1
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 0
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t2 
          TableScan
            alias: t2
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 1
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t3 
          TableScan
            alias: t3
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 2
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
               Inner Join 1 to 2
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10, 
_col11, _col12
          Filter Operator
            predicate:
                expr: (_col10 = 20)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col1
                    type: string
                    expr: _col2
                    type: string
                    expr: _col5
                    type: int
                    expr: _col6
                    type: string
                    expr: _col7
                    type: string
                    expr: _col10
                    type: int
                    expr: _col11
                    type: string
                    expr: _col12
                    type: string
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6, _col7, _col8
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}
while the correct behavior is to push the filter "t3.id=20" down:
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t1 
          TableScan
            alias: t1
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 0
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t2 
          TableScan
            alias: t2
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 1
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t3 
          TableScan
            alias: t3
            Filter Operator
              predicate:
                  expr: (id = 20)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: id
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: id
                      type: int
                tag: 2
                value expressions:
                      expr: id
                      type: int
                      expr: key
                      type: string
                      expr: value
                      type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
               Inner Join 1 to 2
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10, 
_col11, _col12
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: string
                  expr: _col2
                  type: string
                  expr: _col5
                  type: int
                  expr: _col6
                  type: string
                  expr: _col7
                  type: string
                  expr: _col10
                  type: int
                  expr: _col11
                  type: string
                  expr: _col12
                  type: string
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

The current behavior is actually stranger than this: for a left outer join 
(similarly for a right outer join), hive finds the leftmost alias referred to 
in the *predicates* of left outer joins and rejects any alias to the right of 
it for pushdown.  So in this query the filter "t2.id=20" pushed down:
{noformat}
explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
(t2.id=t3.id) where t2.id=20;
{noformat}
while it isn't here:
{noformat}
explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
(t1.id=t3.id) where t2.id=20;
{noformat}

> Predicate pushdown erroneously conservative with outer joins
> ------------------------------------------------------------
>
>                 Key: HIVE-2337
>                 URL: https://issues.apache.org/jira/browse/HIVE-2337
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>         Attachments: HIVE-2337v1.patch
>
>
> The predicate pushdown filter is not applying left associativity of joins 
> correctly in determining possible aliases for pushing predicates.
> In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for 
> pushing aliases is specified as:
> {noformat}
>     /**
>      * Figures out the aliases for whom it is safe to push predicates based on
>      * ANSI SQL semantics For inner join, all predicates for all aliases can 
> be
>      * pushed For full outer join, none of the predicates can be pushed as 
> that
>      * would limit the number of rows for join For left outer join, all the
>      * predicates on the left side aliases can be pushed up For right outer
>      * join, all the predicates on the right side aliases can be pushed up 
> Joins
>      * chain containing both left and right outer joins are treated as full
>      * outer join. [...]
>      *
>      * @param op
>      *          Join Operator
>      * @param rr
>      *          Row resolver
>      * @return set of qualified aliases
>      */
> {noformat}
> Since hive joins are left associative, something like "a RIGHT OUTER JOIN b 
> LEFT OUTER JOIN c INNER JOIN d" should be interpreted as "((a RIGHT OUTER 
> JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins 
> with both left and right outer joins can have aliases that can be pushed.  
> Here, aliases b and d are eligible to be pushed up while the current criteria 
> provide that none are eligible.
> Using:
> {noformat}
> create table t1 (id int, key string, value string);
> create table t2 (id int, key string, value string);
> create table t3 (id int, key string, value string);
> create table t4 (id int, key string, value string);
> {noformat}
> For example, the query
> {noformat}
> explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on 
> t2.id=t3.id where t3.id=20; 
> {noformat}
> currently gives
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1 
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2 
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3 
>           TableScan
>             alias: t3
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 2
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, 
> _col10, _col11, _col12
>           Filter Operator
>             predicate:
>                 expr: (_col10 = 20)
>                 type: boolean
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: string
>                     expr: _col2
>                     type: string
>                     expr: _col5
>                     type: int
>                     expr: _col6
>                     type: string
>                     expr: _col7
>                     type: string
>                     expr: _col10
>                     type: int
>                     expr: _col11
>                     type: string
>                     expr: _col12
>                     type: string
>               outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> while the correct behavior is to push the filter "t3.id=20" down:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1 
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2 
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3 
>           TableScan
>             alias: t3
>             Filter Operator
>               predicate:
>                   expr: (id = 20)
>                   type: boolean
>               Reduce Output Operator
>                 key expressions:
>                       expr: id
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: id
>                       type: int
>                 tag: 2
>                 value expressions:
>                       expr: id
>                       type: int
>                       expr: key
>                       type: string
>                       expr: value
>                       type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, 
> _col10, _col11, _col12
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col1
>                   type: string
>                   expr: _col2
>                   type: string
>                   expr: _col5
>                   type: int
>                   expr: _col6
>                   type: string
>                   expr: _col7
>                   type: string
>                   expr: _col10
>                   type: int
>                   expr: _col11
>                   type: string
>                   expr: _col12
>                   type: string
>             outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> The current behavior is actually stranger than this: for a left outer join 
> (similarly for a right outer join), hive finds the leftmost alias referred to 
> in the *predicates* of left outer joins and rejects any alias to the right of 
> it for pushdown.  So in this query the filter "t2.id=20" pushed down:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
> (t2.id=t3.id) where t2.id=20;
> {noformat}
> while it isn't here:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
> (t1.id=t3.id) where t2.id=20;
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins

Reply via email to