[
https://issues.apache.org/jira/browse/DRILL-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622897#comment-14622897
]
Jinfeng Ni commented on DRILL-2543:
-----------------------------------
After checking the SQL specification, I feel that result on Drill side might be
wrong.
The query comes down to this basic predicate:
Value NOT IN (Empty_Set)
(For the third row, value = NULL, and the correlated subquery would have
empty_set).
Section 8.4 in SQL standard 2011:
{code}
4) The expression
RVC NOT IN IPV is equivalent to
NOT ( RVC IN IPV )
5) The expression
RVC IN IPV is equivalent to
RVC = ANY IPV
{code}
Section 8.9
{code}
<quantified comparison predicate> ::=
<row value predicand> <quantified comparison predicate part 2>
<quantified comparison predicate part 2> ::=
<comp op> <quantifier> <table subquery>
<quantifier> ::=
<all>
| <some>
<all> ::= ALL
<some> ::=
SOME
| ANY
1)
Let R be the result of the <row value predicand> and let T be the result of the
<table subquery>.
2)
The result of “R <comp op> <quantifier> T” is derived by the application of the
implied <comparison
predicate> “R <comp op> RT” to every row RT in T.
Case:
a) If T is empty or if the implied <comparison predicate> is True for every row
RT in T, then “R <comp op> <all> T” is True.
b) If the implied <comparison predicate> is False for at least one row RT in T,
then “R <comp op> <all> T” is False.
c) If the implied <comparison predicate> is True for at least one row RT in T,
then “R <comp op> <some> T” is True.
d) If T is empty or if the implied <comparison predicate> is False for every
row RT in T, then “R <comp op> <some> T” is False.
e) If “R <comp op> <quantifier> T” is neither True nor False, then it is
Unknown.
{code}
Now, let's see how expression "Value NOT IN (Empty_Set)" is evaluated, using
the SQL specification rule list above:
{code}
Value NOT IN (Empty_Set)
==> NOT (Value in Empty_Set) (Section 8.4 rule 4)
==> NOT ( Value = ANY Empty_Set) (Secton 8.4 rule 5)
==> NOT (FALSE) (Section 8.9 rule rule 2.d, since T
is empty).
==> TRUE
{code}
Therefore, regardless whether right hand side is NULL or not, as long as the
LHS of NOT IN is empty_set, the NOT IN predicate should return TRUE. That's
why the 3 row should be included in the query result.
[~vicky], please let me know if the above reasoning makes sense or not.
> Correlated subquery where outer table contains NULL values returns seemingly
> wrong result
> ------------------------------------------------------------------------------------------
>
> Key: DRILL-2543
> URL: https://issues.apache.org/jira/browse/DRILL-2543
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 0.8.0
> Reporter: Victoria Markman
> Assignee: Jinfeng Ni
> Fix For: 1.3.0
>
>
> {code}
> 0: jdbc:drill:schema=dfs> select * from t1;
> +------------+------------+------------+
> | a1 | b1 | c1 |
> +------------+------------+------------+
> | 1 | 2015-03-01 | aaaaa |
> | 2 | 2015-03-02 | bbbbb |
> | null | null | null |
> +------------+------------+------------+
> 3 rows selected (0.064 seconds)
> 0: jdbc:drill:schema=dfs> select * from t2;
> +------------+------------+------------+
> | a2 | b2 | c2 |
> +------------+------------+------------+
> | 5 | 2017-03-01 | a |
> +------------+------------+------------+
> 1 row selected (0.07 seconds)
> 0: jdbc:drill:schema=dfs> select t1.c1, count(*) from t1 where t1.b1 not in
> (select b2 from t2 where t1.a1 = t2.a2) group by t1.c1 order by t1.c1;
> +------------+------------+
> | c1 | EXPR$1 |
> +------------+------------+
> | aaaaa | 1 |
> | bbbbb | 1 |
> +------------+------------+
> 2 rows selected (0.32 seconds)
> {code}
> Postgres returns row from the outer table where a1 is null.
> This is part that I don't understand, because join condition in the subquery
> should have eliminated row where a1 IS NULL. To me Drill result looks
> correct. Unless there is something different in correlated comparison
> semantics that I'm not aware of.
> {code}
> postgres=# select * from t1;
> a1 | b1 | c1
> ----+------------+-------
> 1 | 2015-03-01 | aaaaa
> 2 | 2015-03-02 | bbbbb
> | |
> (3 rows)
> {code}
> Explain plan for the query:
> {code}
> 00-01 Project(c1=[$0], EXPR$1=[$1])
> 00-02 StreamAgg(group=[{0}], EXPR$1=[COUNT()])
> 00-03 Sort(sort0=[$0], dir0=[ASC])
> 00-04 Project(c1=[$0])
> 00-05 SelectionVectorRemover
> 00-06 Filter(condition=[NOT(IS TRUE($3))])
> 00-07 HashJoin(condition=[=($1, $2)], joinType=[left])
> 00-09 Project($f1=[$0], $f3=[$2])
> 00-11 SelectionVectorRemover
> 00-13 Filter(condition=[IS NOT NULL($1)])
> 00-15 Project(c1=[$1], b1=[$0], a1=[$2])
> 00-17 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1,
> numFiles=1, columns=[`c1`, `b1`, `a1`]]])
> 00-08 Project($f02=[$1], $f2=[$2])
> 00-10 StreamAgg(group=[{0, 1}], agg#0=[MIN($2)])
> 00-12 Sort(sort0=[$0], sort1=[$1], dir0=[ASC],
> dir1=[ASC])
> 00-14 Project($f0=[$1], $f02=[$2], $f1=[true])
> 00-16 HashJoin(condition=[=($2, $0)],
> joinType=[inner])
> 00-18 StreamAgg(group=[{0}])
> 00-20 Sort(sort0=[$0], dir0=[ASC])
> 00-22 Project($f0=[$1])
> 00-23 SelectionVectorRemover
> 00-24 Filter(condition=[IS NOT NULL($0)])
> 00-25 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1,
> numFiles=1, columns=[`b1`, `a1`]]])
> 00-19 Project(a2=[$1], b2=[$0])
> 00-21 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2,
> numFiles=1, columns=[`a2`, `b2`]]])
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)