[jira] [Commented] (DRILL-2543) Correlated subquery where outer table contains NULL values returns seemingly wrong result

Jinfeng Ni (JIRA) Fri, 10 Jul 2015 14:15:19 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622897#comment-14622897
 ]


Jinfeng Ni commented on DRILL-2543:
-----------------------------------

After checking the SQL specification, I feel that result on Drill side might be 
wrong. 

The query comes down to this basic predicate:

 Value NOT IN  (Empty_Set)

(For the third row, value = NULL, and the correlated subquery would have 
empty_set).

Section 8.4 in SQL standard 2011:

{code}
4) The expression
RVC NOT IN IPV is equivalent to
NOT ( RVC IN IPV )

5) The expression
RVC IN IPV is equivalent to

RVC = ANY IPV
{code}

Section 8.9 
{code}
<quantified comparison predicate> ::=
  <row value predicand> <quantified comparison predicate part 2>
<quantified comparison predicate part 2> ::=
  <comp op> <quantifier> <table subquery>
<quantifier> ::=
    <all>
| <some>
<all> ::= ALL
<some> ::=
    SOME
| ANY


1) 
Let R be the result of the <row value predicand> and let T be the result of the 
<table subquery>.
2)
The result of “R <comp op> <quantifier> T” is derived by the application of the 
implied <comparison
predicate> “R <comp op> RT” to every row RT in T.
Case:
a) If T is empty or if the implied <comparison predicate> is True for every row 
RT in T, then “R <comp op> <all> T” is True.
b) If the implied <comparison predicate> is False for at least one row RT in T, 
then “R <comp op> <all> T” is False.
c) If the implied <comparison predicate> is True for at least one row RT in T, 
then “R <comp op> <some> T” is True.
d) If T is empty or if the implied <comparison predicate> is False for every 
row RT in T, then “R <comp op> <some> T” is False.
e) If “R <comp op> <quantifier> T” is neither True nor False, then it is 
Unknown.
{code}

Now, let's see how expression "Value NOT IN  (Empty_Set)" is evaluated, using 
the SQL specification rule list above:
{code}
Value NOT IN  (Empty_Set)
==>  NOT (Value in Empty_Set)      (Section 8.4 rule 4)
==>  NOT  ( Value = ANY Empty_Set)  (Secton 8.4 rule 5)
==>  NOT (FALSE)                          (Section 8.9 rule rule 2.d, since T 
is empty).
==> TRUE
{code}

Therefore, regardless whether right hand side is NULL or not, as long as the 
LHS of NOT IN is empty_set, the NOT IN predicate should return TRUE.  That's 
why the 3 row should be included in the query result.

[~vicky], please let me know if the above reasoning makes sense or not. 






> Correlated subquery where outer table contains NULL values returns  seemingly 
> wrong result
> ------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2543
>                 URL: https://issues.apache.org/jira/browse/DRILL-2543
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Victoria Markman
>            Assignee: Jinfeng Ni
>             Fix For: 1.3.0
>
>
> {code}
> 0: jdbc:drill:schema=dfs> select * from t1;
> +------------+------------+------------+
> |     a1     |     b1     |     c1     |
> +------------+------------+------------+
> | 1          | 2015-03-01 | aaaaa      |
> | 2          | 2015-03-02 | bbbbb      |
> | null       | null       | null       |
> +------------+------------+------------+
> 3 rows selected (0.064 seconds)
> 0: jdbc:drill:schema=dfs> select * from t2;
> +------------+------------+------------+
> |     a2     |     b2     |     c2     |
> +------------+------------+------------+
> | 5          | 2017-03-01 | a          |
> +------------+------------+------------+
> 1 row selected (0.07 seconds)
> 0: jdbc:drill:schema=dfs> select t1.c1, count(*) from t1 where t1.b1 not in 
> (select b2 from t2 where t1.a1 = t2.a2) group by t1.c1 order by t1.c1;
> +------------+------------+
> |     c1     |   EXPR$1   |
> +------------+------------+
> | aaaaa      | 1          |
> | bbbbb      | 1          |
> +------------+------------+
> 2 rows selected (0.32 seconds)
> {code}
> Postgres returns row from the outer table where a1 is null.
> This is part that I don't understand, because join condition in the subquery 
> should have eliminated row where a1 IS NULL. To me Drill result looks 
> correct. Unless there is something different in correlated comparison 
> semantics that I'm not aware of.
> {code}
> postgres=# select * from t1;
>  a1 |     b1     |  c1
> ----+------------+-------
>   1 | 2015-03-01 | aaaaa
>   2 | 2015-03-02 | bbbbb
>     |            |
> (3 rows)
> {code}
> Explain plan for the query:
> {code}
> 00-01      Project(c1=[$0], EXPR$1=[$1])
> 00-02        StreamAgg(group=[{0}], EXPR$1=[COUNT()])
> 00-03          Sort(sort0=[$0], dir0=[ASC])
> 00-04            Project(c1=[$0])
> 00-05              SelectionVectorRemover
> 00-06                Filter(condition=[NOT(IS TRUE($3))])
> 00-07                  HashJoin(condition=[=($1, $2)], joinType=[left])
> 00-09                    Project($f1=[$0], $f3=[$2])
> 00-11                      SelectionVectorRemover
> 00-13                        Filter(condition=[IS NOT NULL($1)])
> 00-15                          Project(c1=[$1], b1=[$0], a1=[$2])
> 00-17                            Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, 
> numFiles=1, columns=[`c1`, `b1`, `a1`]]])
> 00-08                    Project($f02=[$1], $f2=[$2])
> 00-10                      StreamAgg(group=[{0, 1}], agg#0=[MIN($2)])
> 00-12                        Sort(sort0=[$0], sort1=[$1], dir0=[ASC], 
> dir1=[ASC])
> 00-14                          Project($f0=[$1], $f02=[$2], $f1=[true])
> 00-16                            HashJoin(condition=[=($2, $0)], 
> joinType=[inner])
> 00-18                              StreamAgg(group=[{0}])
> 00-20                                Sort(sort0=[$0], dir0=[ASC])
> 00-22                                  Project($f0=[$1])
> 00-23                                    SelectionVectorRemover
> 00-24                                      Filter(condition=[IS NOT NULL($0)])
> 00-25                                        Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, 
> numFiles=1, columns=[`b1`, `a1`]]])
> 00-19                              Project(a2=[$1], b2=[$0])
> 00-21                                Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, 
> numFiles=1, columns=[`a2`, `b2`]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2543) Correlated subquery where outer table contains NULL values returns seemingly wrong result

Reply via email to