[jira] [Comment Edited] (HIVE-24957) Wrong results when subquery has COALESCE in correlation predicate

Stamatis Zampetakis (Jira) Tue, 30 Mar 2021 06:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311521#comment-17311521
 ]


Stamatis Zampetakis edited comment on HIVE-24957 at 3/30/21, 1:21 PM:
----------------------------------------------------------------------

The problem lies in the query plan and more specifically in the 
{{HiveRelDecorrelator}}. 
{noformat}
2021-03-30T06:16:57,682 DEBUG [d8fca83a-1e2a-4864-8730-f496318a0e47 main] 
rules.RelFieldTrimmer: Plan after trimming unused fields
HiveProject(b_title=[$0])
  HiveFilter(condition=[EXISTS({
HiveProject(a_authorkey=[$0])
  HiveFilter(condition=[=(CASE(IS NOT NULL($cor0.b_authorkey), 
$cor0.b_authorkey, 300), $0)])
    HiveTableScan(table=[[default, author]], table:alias=[a])
})])
    HiveProject(b_title=[$1], b_authorkey=[$2])
      HiveTableScan(table=[[default, book]], table:alias=[b])

2021-03-30T06:16:57,682 DEBUG [d8fca83a-1e2a-4864-8730-f496318a0e47 main] 
parse.CalcitePlanner: Plan before removing subquery:
HiveProject(b_title=[$1])
  HiveFilter(condition=[EXISTS({
HiveProject(a_authorkey=[$0])
  HiveFilter(condition=[=(CASE(IS NOT NULL($cor0.b_authorkey), 
$cor0.b_authorkey, 300), $0)])
    HiveTableScan(table=[[default, author]], table:alias=[a])
})])
    HiveTableScan(table=[[default, book]], table:alias=[b])

2021-03-30T06:16:57,690 DEBUG [d8fca83a-1e2a-4864-8730-f496318a0e47 main] 
parse.CalcitePlanner: Plan just after removing subquery:
HiveProject(b_title=[$1])
  LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{2}])
    HiveTableScan(table=[[default, book]], table:alias=[b])
    HiveProject(literalTrue=[true])
      HiveProject(a_authorkey=[$0])
        HiveFilter(condition=[=(CASE(IS NOT NULL($cor0.b_authorkey), 
$cor0.b_authorkey, 300), $0)])
          HiveTableScan(table=[[default, author]], table:alias=[a])

2021-03-30T06:16:57,796 DEBUG [d8fca83a-1e2a-4864-8730-f496318a0e47 main] 
parse.CalcitePlanner: Plan after decorrelation:
HiveProject(b_title=[$1])
  HiveSemiJoin(condition=[=($8, $2)], joinType=[semi])
    HiveTableScan(table=[[default, book]], table:alias=[b])
    HiveProject(literalTrue=[true], b_authorkey=[$1])
      HiveProject(a_authorkey=[$0], b_authorkey=[$6])
        HiveJoin(condition=[=(CASE(IS NOT NULL($6), $6, 300), $0)], 
joinType=[inner], algorithm=[none], cost=[not available])
          HiveTableScan(table=[[default, author]], table:alias=[a])
          HiveAggregate(group=[{0}])
            HiveProject(b_authorkey=[$2])
              HiveTableScan(table=[[default, book]], table:alias=[b])
{noformat}
The problem starts with the introduction of the {{HiveSemiJoin}}. Due to that 
books with NULL {{b_authorkey}} are removed from the result set. 


was (Author: zabetak):
The problem lies in the query plan and more specifically in the 
{{HiveRelDecorrelator}}. 
{noformat}
2021-03-30T06:07:50,279 DEBUG [348e355c-ca0e-4fc6-b386-1852a35a7f29 main] 
rules.RelFieldTrimmer: Plan after trimming unused fields
HiveProject(b_title=[$0])
  HiveFilter(condition=[EXISTS({
HiveProject(_o__c0=[1])
  HiveFilter(condition=[=(CASE(IS NOT NULL($cor0.b_authorkey), 
$cor0.b_authorkey, 300), $0)])
    HiveTableScan(table=[[default, author]], table:alias=[a])
})])
    HiveProject(b_title=[$1], b_authorkey=[$2])
      HiveTableScan(table=[[default, book]], table:alias=[b])

2021-03-30T06:07:50,279 DEBUG [348e355c-ca0e-4fc6-b386-1852a35a7f29 main] 
parse.CalcitePlanner: Plan before removing subquery:
HiveProject(b_title=[$1])
  HiveFilter(condition=[EXISTS({
HiveProject(_o__c0=[1])
  HiveFilter(condition=[=(CASE(IS NOT NULL($cor0.b_authorkey), 
$cor0.b_authorkey, 300), $0)])
    HiveTableScan(table=[[default, author]], table:alias=[a])
})])
    HiveTableScan(table=[[default, book]], table:alias=[b])

2021-03-30T06:07:50,280 DEBUG [348e355c-ca0e-4fc6-b386-1852a35a7f29 main] 
parse.CalcitePlanner: Plan just after removing subquery:
HiveProject(b_title=[$1])
  LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{2}])
    HiveTableScan(table=[[default, book]], table:alias=[b])
    HiveProject(literalTrue=[true])
      HiveProject(_o__c0=[1])
        HiveFilter(condition=[=(CASE(IS NOT NULL($cor0.b_authorkey), 
$cor0.b_authorkey, 300), $0)])
          HiveTableScan(table=[[default, author]], table:alias=[a])

2021-03-30T06:07:50,282 DEBUG [348e355c-ca0e-4fc6-b386-1852a35a7f29 main] 
parse.CalcitePlanner: Plan after decorrelation:
HiveProject(b_title=[$1])
  HiveSemiJoin(condition=[=($8, $2)], joinType=[semi])
    HiveTableScan(table=[[default, book]], table:alias=[b])
    HiveProject(literalTrue=[true], b_authorkey=[$1])
      HiveProject(_o__c0=[1], b_authorkey=[$6])
        HiveJoin(condition=[=(CASE(IS NOT NULL($6), $6, 300), $0)], 
joinType=[inner], algorithm=[none], cost=[not available])
          HiveTableScan(table=[[default, author]], table:alias=[a])
          HiveAggregate(group=[{0}])
            HiveProject(b_authorkey=[$2])
              HiveTableScan(table=[[default, book]], table:alias=[b])
{noformat}
The problem starts with the introduction of the {{HiveSemiJoin}}. Due to that 
books with NULL {{b_authorkey}} are removed from the result set. 

> Wrong results when subquery has COALESCE in correlation predicate
> -----------------------------------------------------------------
>
>                 Key: HIVE-24957
>                 URL: https://issues.apache.org/jira/browse/HIVE-24957
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> Consider the following example:
> {code:sql}
> create table author (
>     a_authorkey   int,
>     a_name varchar(50));
> create table book (
>     b_bookkey   int,
>     b_title varchar(50),
>     b_authorkey int);
> insert into author values (10, 'Victor Hugo');
> insert into author values (20, 'Alexandre Dumas');
> insert into author values (300, 'UNKNOWN');
> insert into book values (1, 'Les Miserables', 10);
> insert into book values (2, 'The Count of Monte Cristo', 20);
> insert into book values (3, 'Men Without Women', 30);
> insert into book values (4, 'Odyssey', null);
> select b.b_title
> from book b
> where exists
>           (select a_authorkey
>            from author a
>            where coalesce(b.b_authorkey, 300) = a.a_authorkey);
> {code}
> *Expected results*
> ||B_TITLE||
> |Les Miserables|
> |The Count of Monte Cristo|
> |Odyssey|
> *Actual results*
> ||B_TITLE||
> |Les Miserables|
> |The Count of Monte Cristo|
> {{Odyssey}} is missing from the result set and it shouldn't since with the 
> application of COALESCE operator it should match with the UNKNOWN author.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HIVE-24957) Wrong results when subquery has COALESCE in correlation predicate

Reply via email to