[ https://issues.apache.org/jira/browse/DRILL-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622897#comment-14622897 ]
Jinfeng Ni commented on DRILL-2543: ----------------------------------- After checking the SQL specification, I feel that result on Drill side might be wrong. The query comes down to this basic predicate: Value NOT IN (Empty_Set) (For the third row, value = NULL, and the correlated subquery would have empty_set). Section 8.4 in SQL standard 2011: {code} 4) The expression RVC NOT IN IPV is equivalent to NOT ( RVC IN IPV ) 5) The expression RVC IN IPV is equivalent to RVC = ANY IPV {code} Section 8.9 {code} <quantified comparison predicate> ::= <row value predicand> <quantified comparison predicate part 2> <quantified comparison predicate part 2> ::= <comp op> <quantifier> <table subquery> <quantifier> ::= <all> | <some> <all> ::= ALL <some> ::= SOME | ANY 1) Let R be the result of the <row value predicand> and let T be the result of the <table subquery>. 2) The result of “R <comp op> <quantifier> T” is derived by the application of the implied <comparison predicate> “R <comp op> RT” to every row RT in T. Case: a) If T is empty or if the implied <comparison predicate> is True for every row RT in T, then “R <comp op> <all> T” is True. b) If the implied <comparison predicate> is False for at least one row RT in T, then “R <comp op> <all> T” is False. c) If the implied <comparison predicate> is True for at least one row RT in T, then “R <comp op> <some> T” is True. d) If T is empty or if the implied <comparison predicate> is False for every row RT in T, then “R <comp op> <some> T” is False. e) If “R <comp op> <quantifier> T” is neither True nor False, then it is Unknown. {code} Now, let's see how expression "Value NOT IN (Empty_Set)" is evaluated, using the SQL specification rule list above: {code} Value NOT IN (Empty_Set) ==> NOT (Value in Empty_Set) (Section 8.4 rule 4) ==> NOT ( Value = ANY Empty_Set) (Secton 8.4 rule 5) ==> NOT (FALSE) (Section 8.9 rule rule 2.d, since T is empty). ==> TRUE {code} Therefore, regardless whether right hand side is NULL or not, as long as the LHS of NOT IN is empty_set, the NOT IN predicate should return TRUE. That's why the 3 row should be included in the query result. [~vicky], please let me know if the above reasoning makes sense or not. > Correlated subquery where outer table contains NULL values returns seemingly > wrong result > ------------------------------------------------------------------------------------------ > > Key: DRILL-2543 > URL: https://issues.apache.org/jira/browse/DRILL-2543 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Affects Versions: 0.8.0 > Reporter: Victoria Markman > Assignee: Jinfeng Ni > Fix For: 1.3.0 > > > {code} > 0: jdbc:drill:schema=dfs> select * from t1; > +------------+------------+------------+ > | a1 | b1 | c1 | > +------------+------------+------------+ > | 1 | 2015-03-01 | aaaaa | > | 2 | 2015-03-02 | bbbbb | > | null | null | null | > +------------+------------+------------+ > 3 rows selected (0.064 seconds) > 0: jdbc:drill:schema=dfs> select * from t2; > +------------+------------+------------+ > | a2 | b2 | c2 | > +------------+------------+------------+ > | 5 | 2017-03-01 | a | > +------------+------------+------------+ > 1 row selected (0.07 seconds) > 0: jdbc:drill:schema=dfs> select t1.c1, count(*) from t1 where t1.b1 not in > (select b2 from t2 where t1.a1 = t2.a2) group by t1.c1 order by t1.c1; > +------------+------------+ > | c1 | EXPR$1 | > +------------+------------+ > | aaaaa | 1 | > | bbbbb | 1 | > +------------+------------+ > 2 rows selected (0.32 seconds) > {code} > Postgres returns row from the outer table where a1 is null. > This is part that I don't understand, because join condition in the subquery > should have eliminated row where a1 IS NULL. To me Drill result looks > correct. Unless there is something different in correlated comparison > semantics that I'm not aware of. > {code} > postgres=# select * from t1; > a1 | b1 | c1 > ----+------------+------- > 1 | 2015-03-01 | aaaaa > 2 | 2015-03-02 | bbbbb > | | > (3 rows) > {code} > Explain plan for the query: > {code} > 00-01 Project(c1=[$0], EXPR$1=[$1]) > 00-02 StreamAgg(group=[{0}], EXPR$1=[COUNT()]) > 00-03 Sort(sort0=[$0], dir0=[ASC]) > 00-04 Project(c1=[$0]) > 00-05 SelectionVectorRemover > 00-06 Filter(condition=[NOT(IS TRUE($3))]) > 00-07 HashJoin(condition=[=($1, $2)], joinType=[left]) > 00-09 Project($f1=[$0], $f3=[$2]) > 00-11 SelectionVectorRemover > 00-13 Filter(condition=[IS NOT NULL($1)]) > 00-15 Project(c1=[$1], b1=[$0], a1=[$2]) > 00-17 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, > numFiles=1, columns=[`c1`, `b1`, `a1`]]]) > 00-08 Project($f02=[$1], $f2=[$2]) > 00-10 StreamAgg(group=[{0, 1}], agg#0=[MIN($2)]) > 00-12 Sort(sort0=[$0], sort1=[$1], dir0=[ASC], > dir1=[ASC]) > 00-14 Project($f0=[$1], $f02=[$2], $f1=[true]) > 00-16 HashJoin(condition=[=($2, $0)], > joinType=[inner]) > 00-18 StreamAgg(group=[{0}]) > 00-20 Sort(sort0=[$0], dir0=[ASC]) > 00-22 Project($f0=[$1]) > 00-23 SelectionVectorRemover > 00-24 Filter(condition=[IS NOT NULL($0)]) > 00-25 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, > numFiles=1, columns=[`b1`, `a1`]]]) > 00-19 Project(a2=[$1], b2=[$0]) > 00-21 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, > numFiles=1, columns=[`a2`, `b2`]]]) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)