[jira] [Commented] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins

John Sichi (JIRA) Thu, 01 Sep 2011 14:18:37 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095607#comment-13095607
 ]


John Sichi commented on HIVE-2337:
----------------------------------

+1.  Will commit when tests pass.


> Predicate pushdown erroneously conservative with outer joins
> ------------------------------------------------------------
>
>                 Key: HIVE-2337
>                 URL: https://issues.apache.org/jira/browse/HIVE-2337
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>             Fix For: 0.9.0
>
>         Attachments: HIVE-2337v1.patch, HIVE-2337v2.patch, HIVE-2337v3.patch, 
> HIVE-2337v4.patch, HIVE-2337v5.patch, HIVE-2337v6.patch, HIVE-2337v7.patch
>
>
> The predicate pushdown filter is not applying left associativity of joins 
> correctly in determining possible aliases for pushing predicates.
> In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for 
> pushing aliases is specified as:
> {noformat}
>     /**
>      * Figures out the aliases for whom it is safe to push predicates based on
>      * ANSI SQL semantics For inner join, all predicates for all aliases can 
> be
>      * pushed For full outer join, none of the predicates can be pushed as 
> that
>      * would limit the number of rows for join For left outer join, all the
>      * predicates on the left side aliases can be pushed up For right outer
>      * join, all the predicates on the right side aliases can be pushed up 
> Joins
>      * chain containing both left and right outer joins are treated as full
>      * outer join. [...]
>      *
>      * @param op
>      *          Join Operator
>      * @param rr
>      *          Row resolver
>      * @return set of qualified aliases
>      */
> {noformat}
> Since hive joins are left associative, something like "a RIGHT OUTER JOIN b 
> LEFT OUTER JOIN c INNER JOIN d" should be interpreted as "((a RIGHT OUTER 
> JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins 
> with both left and right outer joins can have aliases that can be pushed.  
> Here, aliases b and d are eligible to be pushed up while the current criteria 
> provide that none are eligible.
> Using:
> {noformat}
> create table t1 (id int, key string, value string);
> create table t2 (id int, key string, value string);
> create table t3 (id int, key string, value string);
> create table t4 (id int, key string, value string);
> {noformat}
> For example, the query
> {noformat}
> explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on 
> t2.id=t3.id where t3.id=20; 
> {noformat}
> currently gives
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1 
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2 
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3 
>           TableScan
>             alias: t3
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 2
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, 
> _col10, _col11, _col12
>           Filter Operator
>             predicate:
>                 expr: (_col10 = 20)
>                 type: boolean
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: string
>                     expr: _col2
>                     type: string
>                     expr: _col5
>                     type: int
>                     expr: _col6
>                     type: string
>                     expr: _col7
>                     type: string
>                     expr: _col10
>                     type: int
>                     expr: _col11
>                     type: string
>                     expr: _col12
>                     type: string
>               outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> while the correct behavior is to push the filter "t3.id=20" down:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t1 
>           TableScan
>             alias: t1
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t2 
>           TableScan
>             alias: t2
>             Reduce Output Operator
>               key expressions:
>                     expr: id
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: id
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: id
>                     type: int
>                     expr: key
>                     type: string
>                     expr: value
>                     type: string
>         t3 
>           TableScan
>             alias: t3
>             Filter Operator
>               predicate:
>                   expr: (id = 20)
>                   type: boolean
>               Reduce Output Operator
>                 key expressions:
>                       expr: id
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: id
>                       type: int
>                 tag: 2
>                 value expressions:
>                       expr: id
>                       type: int
>                       expr: key
>                       type: string
>                       expr: value
>                       type: string
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>             2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, 
> _col10, _col11, _col12
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col1
>                   type: string
>                   expr: _col2
>                   type: string
>                   expr: _col5
>                   type: int
>                   expr: _col6
>                   type: string
>                   expr: _col7
>                   type: string
>                   expr: _col10
>                   type: int
>                   expr: _col11
>                   type: string
>                   expr: _col12
>                   type: string
>             outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> The current behavior is actually stranger than this: for a left outer join 
> (similarly for a right outer join), hive finds the leftmost alias referred to 
> in the *predicates* of left outer joins and rejects any alias to the right of 
> it for pushdown.  So in this query the filter "t2.id=20" pushed down:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
> (t2.id=t3.id) where t2.id=20;
> {noformat}
> while it isn't here:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on 
> (t1.id=t3.id) where t2.id=20;
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2337) Predicate pushdown erroneously conservative with outer joins

Reply via email to