[jira] [Updated] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins

Charles Chen (JIRA) Wed, 03 Aug 2011 11:31:51 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Charles Chen updated HIVE-2337:
-------------------------------

    Description: 
The predicate pushdown filter is not applying left associativity of joins 
correctly in determining possible aliases for pushing predicates.

In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for 
pushing aliases is specified as:
{noformat}
    /**
     * Figures out the aliases for whom it is safe to push predicates based on
     * ANSI SQL semantics For inner join, all predicates for all aliases can be
     * pushed For full outer join, none of the predicates can be pushed as that
     * would limit the number of rows for join For left outer join, all the
     * predicates on the left side aliases can be pushed up For right outer
     * join, all the predicates on the right side aliases can be pushed up Joins
     * chain containing both left and right outer joins are treated as full
     * outer join. [...]
     *
     * @param op
     *          Join Operator
     * @param rr
     *          Row resolver
     * @return set of qualified aliases
     */
{noformat}

Since hive joins are left associative, something like "a RIGHT OUTER JOIN b 
LEFT OUTER JOIN c INNER JOIN d" should be interpreted as "((a RIGHT OUTER JOIN 
b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins with 
both left and right outer joins can have aliases that can be pushed.  Here, 
aliases b and d are eligible to be pushed up while the current criteria provide 
that none are eligible.

Using:
{noformat}
create table t1 (id int, key string, value string);
create table t2 (id int, key string, value string);
create table t3 (id int, key string, value string);
create table t4 (id int, key string, value string);
{noformat}

For example, the query
{noformat}
explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on 
t2.id=t3.id where t3.id=20; 
{noformat}
currently gives
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t1 
          TableScan
            alias: t1
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 0
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t2 
          TableScan
            alias: t2
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 1
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t3 
          TableScan
            alias: t3
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 2
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
               Inner Join 1 to 2
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10, 
_col11, _col12
          Filter Operator
            predicate:
                expr: (_col10 = 20)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col1
                    type: string
                    expr: _col2
                    type: string
                    expr: _col5
                    type: int
                    expr: _col6
                    type: string
                    expr: _col7
                    type: string
                    expr: _col10
                    type: int
                    expr: _col11
                    type: string
                    expr: _col12
                    type: string
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6, _col7, _col8
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}
while the correct behavior is to push the filter "t3.id=20" down:
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t1 
          TableScan
            alias: t1
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 0
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t2 
          TableScan
            alias: t2
            Reduce Output Operator
              key expressions:
                    expr: id
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: id
                    type: int
              tag: 1
              value expressions:
                    expr: id
                    type: int
                    expr: key
                    type: string
                    expr: value
                    type: string
        t3 
          TableScan
            alias: t3
            Filter Operator
              predicate:
                  expr: (id = 20)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: id
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: id
                      type: int
                tag: 2
                value expressions:
                      expr: id
                      type: int
                      expr: key
                      type: string
                      expr: value
                      type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
               Inner Join 1 to 2
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
            2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10, 
_col11, _col12
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: string
                  expr: _col2
                  type: string
                  expr: _col5
                  type: int
                  expr: _col6
                  type: string
                  expr: _col7
                  type: string
                  expr: _col10
                  type: int
                  expr: _col11
                  type: string
                  expr: _col12
                  type: string
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

The current behavior is actually stranger than this: for a left outer join 
(similarly for a right outer join), hive finds the leftmost alias referred to 
in the *predicates* of left outer joins and rejects any alias to the right of 
it for pushdown.  So in this query the filter "t2.id=20" pushed down:
{noformat}
explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
(t2.id=t3.id) where t2.id=20;
{noformat}
while it isn't here:
{noformat}
explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
(t1.id=t3.id) where t2.id=20;
{noformat}

  was:
The predicate pushdown filter is not applying left associativity of joins 
correctly in determining possible aliases for pushing predicates.

In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for 
pushing aliases is specified as:
{noformat}
    /**
     * Figures out the aliases for whom it is safe to push predicates based on
     * ANSI SQL semantics For inner join, all predicates for all aliases can be
     * pushed For full outer join, none of the predicates can be pushed as that
     * would limit the number of rows for join For left outer join, all the
     * predicates on the left side aliases can be pushed up For right outer
     * join, all the predicates on the right side aliases can be pushed up Joins
     * chain containing both left and right outer joins are treated as full
     * outer join. [...]
     *
     * @param op
     *          Join Operator
     * @param rr
     *          Row resolver
     * @return set of qualified aliases
     */
{noformat}

Since hive joins are left associative, something like "a RIGHT OUTER JOIN b 
LEFT OUTER JOIN c INNER JOIN d" should be interpreted as "((a RIGHT OUTER JOIN 
b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins with 
both left and right outer joins can have aliases that can be pushed.  Here, 
aliases b and d are eligible to be pushed up while the current criteria provide 
that none are eligible.


> Predicate pushdown erroneously conservative with outer joins
> ------------------------------------------------------------
>
>                 Key: HIVE-2337
>                 URL: https://issues.apache.org/jira/browse/HIVE-2337
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>         Attachments: HIVE-2337v1.patch
>
>
> The predicate pushdown filter is not applying left associativity of joins 
> correctly in determining possible aliases for pushing predicates.
> In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for 
> pushing aliases is specified as:
> {noformat}
>     /**
>      * Figures out the aliases for whom it is safe to push predicates based on
>      * ANSI SQL semantics For inner join, all predicates for all aliases can 
> be
>      * pushed For full outer join, none of the predicates can be pushed as 
> that
>      * would limit the number of rows for join For left outer join, all the
>      * predicates on the left side aliases can be pushed up For right outer
>      * join, all the predicates on the right side aliases can be pushed up 
> Joins
>      * chain containing both left and right outer joins are treated as full
>      * outer join. [...]
>      *
>      * @param op
>      *          Join Operator
>      * @param rr
>      *          Row resolver
>      * @return set of qualified aliases
>      */
> {noformat}
> Since hive joins are left associative, something like "a RIGHT OUTER JOIN b 
> LEFT OUTER JOIN c INNER JOIN d" should be interpreted as "((a RIGHT OUTER 
> JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins 
> with both left and right outer joins can have aliases that can be pushed.  
> Here, aliases b and d are eligible to be pushed up while the current criteria 
> provide that none are eligible.
> Using:
> {noformat}
> create table t1 (id int, key string, value string);
> create table t2 (id int, key string, value string);
> create table t3 (id int, key string, value string);
> create table t4 (id int, key string, value string);
> {noformat}
> For example, the query
> {noformat}
> explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on 
> t2.id=t3.id where t3.id=20; 
> {noformat}
> currently gives
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1 
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2 
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3 
>           TableScan
>             alias: t3
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 2
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, 
> _col10, _col11, _col12
>           Filter Operator
>             predicate:
>                 expr: (_col10 = 20)
>                 type: boolean
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: string
>                     expr: _col2
>                     type: string
>                     expr: _col5
>                     type: int
>                     expr: _col6
>                     type: string
>                     expr: _col7
>                     type: string
>                     expr: _col10
>                     type: int
>                     expr: _col11
>                     type: string
>                     expr: _col12
>                     type: string
>               outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> while the correct behavior is to push the filter "t3.id=20" down:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1 
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2 
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3 
>           TableScan
>             alias: t3
>             Filter Operator
>               predicate:
>                   expr: (id = 20)
>                   type: boolean
>               Reduce Output Operator
>                 key expressions:
>                       expr: id
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: id
>                       type: int
>                 tag: 2
>                 value expressions:
>                       expr: id
>                       type: int
>                       expr: key
>                       type: string
>                       expr: value
>                       type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, 
> _col10, _col11, _col12
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col1
>                   type: string
>                   expr: _col2
>                   type: string
>                   expr: _col5
>                   type: int
>                   expr: _col6
>                   type: string
>                   expr: _col7
>                   type: string
>                   expr: _col10
>                   type: int
>                   expr: _col11
>                   type: string
>                   expr: _col12
>                   type: string
>             outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> The current behavior is actually stranger than this: for a left outer join 
> (similarly for a right outer join), hive finds the leftmost alias referred to 
> in the *predicates* of left outer joins and rejects any alias to the right of 
> it for pushdown.  So in this query the filter "t2.id=20" pushed down:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
> (t2.id=t3.id) where t2.id=20;
> {noformat}
> while it isn't here:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
> (t1.id=t3.id) where t2.id=20;
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins

Reply via email to