[jira] [Commented] (HIVE-12465) Hive might produce wrong results when (outer) joins are merged

Jesus Camacho Rodriguez (JIRA) Tue, 24 Nov 2015 07:36:45 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024688#comment-15024688
 ]


Jesus Camacho Rodriguez commented on HIVE-12465:
------------------------------------------------

[~ashutoshc], could you take a look? I need to regenerate the mergejoin.q test 
file for Tez; the rest of the test fails are not related.

SemanticAnalyzer itself has some simple logic to push predicates down. This 
patch checks whether the join below is an outer join, in which case the 
predicate cannot be pushed, as it would lead to incorrect results. I have added 
tests to mergejoin.q that reproduce the issue.

> Hive might produce wrong results when (outer) joins are merged
> --------------------------------------------------------------
>
>                 Key: HIVE-12465
>                 URL: https://issues.apache.org/jira/browse/HIVE-12465
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Blocker
>         Attachments: HIVE-12465.01.patch, HIVE-12465.patch
>
>
> Consider the following query:
> {noformat}
> select * from
>   (select * from tab where tab.key = 0)a
> full outer join
>   (select * from tab_part where tab_part.key = 98)b
> join
>   tab_part c
> on a.key = b.key and b.key = c.key;
> {noformat}
> Hive should execute the full outer join operation (without ON clause) and 
> then the join operation (ON a.key = b.key and b.key = c.key). Instead, it 
> merges both joins, generating the following plan:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: tab
>             filterExpr: (key = 0) (type: boolean)
>             Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE 
> Column stats: NONE
>             Filter Operator
>               predicate: (key = 0) (type: boolean)
>               Statistics: Num rows: 121 Data size: 11374 Basic stats: 
> COMPLETE Column stats: NONE
>               Select Operator
>                 expressions: 0 (type: int), value (type: string), ds (type: 
> string)
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 121 Data size: 11374 Basic stats: 
> COMPLETE Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col0 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col0 (type: int)
>                   Statistics: Num rows: 121 Data size: 11374 Basic stats: 
> COMPLETE Column stats: NONE
>                   value expressions: _col1 (type: string), _col2 (type: 
> string)
>           TableScan
>             alias: tab_part
>             filterExpr: (key = 98) (type: boolean)
>             Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE 
> Column stats: NONE
>             Filter Operator
>               predicate: (key = 98) (type: boolean)
>               Statistics: Num rows: 250 Data size: 23500 Basic stats: 
> COMPLETE Column stats: NONE
>               Select Operator
>                 expressions: 98 (type: int), value (type: string), ds (type: 
> string)
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 250 Data size: 23500 Basic stats: 
> COMPLETE Column stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col0 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col0 (type: int)
>                   Statistics: Num rows: 250 Data size: 23500 Basic stats: 
> COMPLETE Column stats: NONE
>                   value expressions: _col1 (type: string), _col2 (type: 
> string)
>           TableScan
>             alias: c
>             Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE 
> Column stats: NONE
>             Reduce Output Operator
>               key expressions: key (type: int)
>               sort order: +
>               Map-reduce partition columns: key (type: int)
>               Statistics: Num rows: 500 Data size: 47000 Basic stats: 
> COMPLETE Column stats: NONE
>               value expressions: value (type: string), ds (type: string)
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           keys:
>             0 _col0 (type: int)
>             1 _col0 (type: int)
>             2 key (type: int)
>           outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8
>           Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE 
> Column stats: NONE
>           File Output Operator
>             compressed: false
>             Statistics: Num rows: 1100 Data size: 103400 Basic stats: 
> COMPLETE Column stats: NONE
>             table:
>                 input format: org.apache.hadoop.mapred.TextInputFormat
>                 output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {noformat}
> That plan is equivalent to the following query, which is different than the 
> original one:
> {noformat}
> select * from
>   (select * from tab where tab.key = 0)a
> full outer join
>   (select * from tab_part where tab_part.key = 98)b
> on a.key = b.key
> join
>   tab_part c
> on b.key = c.key;
> {noformat}
> It seems to be a problem in the recognition of join operations that can be 
> merged into a single multijoin operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12465) Hive might produce wrong results when (outer) joins are merged

Reply via email to