James Norvell created HIVE-21322: ------------------------------------ Summary: Multiple table LEFT OUTER JOIN results are incorrect when 'is not null' used in WHERE clause. Key: HIVE-21322 URL: https://issues.apache.org/jira/browse/HIVE-21322 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 2.3.4 Environment: Hive 2.3.4 (emr-5.21.0) Reporter: James Norvell Attachments: explain-plans.txt
Reproduction: Create tables: {code:java} create table a (id string); insert into a values (1),(2),(3); create table b (id string, name string); insert into b values (1,'a'),(2,'b'),(3,null); create table c (id string); insert into c values (11),(22),(33); {code} When joining a -> b -> c on id, the following query is correct: {code:java} select a.id, b.name from a left outer join b on a.id = b.id left outer join c on a.id = c.id where b.name is not null; OK 1 a 2 b Time taken: 10.231 seconds, Fetched: 2 row(s) {code} Switching the join order from a -> c -> b results in incorrect results: {code:java} select a.id, b.name from a left outer join c on a.id = c.id left outer join b on a.id = b.id where b.name is not null; OK 2 b Time taken: 10.321 seconds, Fetched: 1 row(s) {code} Disabling hive.cbo.enable or changing execution engine to mr avoids the issue: {code:java} set hive.cbo.enable=false; select a.id, b.name from a left outer join c on a.id = c.id left outer join b on a.id = b.id where b.name is not null; OK 1 a 2 b Time taken: 9.614 seconds, Fetched: 2 row(s) set hive.cbo.enable=true; set hive.execution.engine=mr; select a.id, b.name from a left outer join c on a.id = c.id left outer join b on a.id = b.id where b.name is not null; OK 1 a 2 b Time taken: 29.377 seconds, Fetched: 2 row(s) {code} Issue doesn't occur when using 'is null': {code:java} select a.id, b.name from a left outer join c on a.id = c.id left outer join b on a.id = b.id where b.name is null; OK 3 NULL Time taken: 9.673 seconds, Fetched: 1 row(s) {code} Explain plans for queries attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)