[ 
https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783457#comment-15783457
 ] 

Nattavut Sutyanyong edited comment on SPARK-18966 at 12/28/16 6:40 PM:
-----------------------------------------------------------------------

Considering the following subquery:

{code}
select *
from   t1
where  a1 not in (select a2
                  from   t2
                  where  t2.b2 = t1.b1)
{code}

There are a number of scenarios to consider:

- 1. When the correlated predicate yields a match (i.e., T2.B2 = T1.B1)
   -- 1.1. When the NOT IN expression yields a match (i.e., T1.A1 = T2.A2)
   -- 1.2. When the NOT IN expression yields no match (i.e., T1.A1 = T2.A2 
returns false)
   -- 1.3. When T1.A1 is null
   -- 1.4. When T2.A2 is null
       --- 1.4.1. When T1.A1 is not null
       --- 1.4.2. When T1.A1 is null
- 2. When the correlated predicate yields no match (i.e., T2.B2 = T1.B1 is 
false or unknown)
   -- 2.1. When T2.B2 is null and T1.B1 is null
   -- 2.2. When T2.B2 is null and T1.B1 is not null
   -- 2.3. When the value of T1.B1 does not match any of T2.B2

{code}
T1.A1  T1.B1     T2.A2  T2.B2    
-----  -----     -----  -----
    1      1         1      1    (1.1)
    2      1                     (1.2)
 null      1                     (1.3)

    1      3      null      3    (1.4.1)
 null      3                     (1.4.2)

    1   null         1   null    (2.1)
 null      2                     (2.2 & 2.3)
{code}

We can divide the evaluation of the above correlated NOT IN subquery into 2 
groups:-

Group 1: The rows in T1 when there is a match from the correlated predicate 
(T1.B1 = T2.B2)

In this case, the result of the subquery is not empty and the semantics of the 
NOT IN depends solely on the evaluation of the equality comparison of the 
columns of NOT IN, i.e., A1 = A2, which says

# If T1.A1 is null, the row is filtered (1.3 and 1.4.2)
# If T1.A1 = T2.A2, the row is filtered (1.1)
# If T2.A2 is null, any rows of T1 in the same group (T1.B1 = T2.B2) is filtered
   (1.4.1 & 1.4.2)
# Otherwise, the row is qualified.

Hence, in this group, the result is the row from (1.2).

Group 2: The rows in T1 when there is no match from the correlated predicate 
(T1.B1 = T2.B2)

In this case, all the rows in T1, including the rows where T1.A1, are qualified 
because the subquery returns an empty set and by the semantics of the NOT IN, 
all rows from the parent side qualifies as the result set, that is, the rows 
from (2.1, 2.2 and 2.3).

In conclusion, the correct result set of the above query is

{code}
T1.A1  T1.B1
-----  -----
    2      1    (1.2)
    1   null    (2.1)
 null      2    (2.2 & 2.3)
{code}


was (Author: nsyca):
Considering the following subquery:

{code}
select *
from   t1
where  a1 not in (select a2
                  from   t2
                  where  t2.b2 = t1.b1)
{code}

There are a number of scenarios to consider:

- 1. When the correlated predicate yields a match (i.e., T2.B2 = T1.B1)
   -- 1.1. When the NOT IN expression yields a match
       (i.e., T1.A1 = T2.A2)
   -- 1.2. When the NOT IN expression yields no match
       (i.e., T1.A1 = T2.A2 returns false)
   -- 1.3. When T1.A1 is null
   -- 1.4. When T2.A2 is null
       --- 1.4.1. When T1.A1 is not null
       --- 1.4.2. When T1.A1 is null
- 2. When the correlated predicate yields no match (i.e., T2.B2 = T1.B1 is 
false or unknown)
   -- 2.1. When T2.B2 is null and T1.B1 is null
   -- 2.2. When T2.B2 is null and T1.B1 is not null
   -- 2.3. When the value of T1.B1 does not match any of T2.B2

{code}
T1.A1  T1.B1     T2.A2  T2.B2    
-----  -----     -----  -----
    1      1         1      1    (1.1)
    2      1                     (1.2)
 null      1                     (1.3)

    1      3      null      3    (1.4.1)
 null      3                     (1.4.2)

    1   null         1   null    (2.1)
 null      2                     (2.2 & 2.3)
{code}

We can divide the evaluation of the above correlated NOT IN subquery into 2 
groups:-

Group 1: The rows in T1 when there is a match from the correlated predicate 
(T1.B1 = T2.B2)

In this case, the result of the subquery is not empty and the semantics of the 
NOT IN depends solely on the evaluation of the equality comparison of the 
columns of NOT IN, i.e., A1 = A2, which says

# If T1.A1 is null, the row is filtered (1.3 and 1.4.2)
# If T1.A1 = T2.A2, the row is filtered (1.1)
# If T2.A2 is null, any rows of T1 in the same group (T1.B1 = T2.B2) is filtered
   (1.4.1 & 1.4.2)
# Otherwise, the row is qualified.

Hence, in this group, the result is the row from (1.2).

Group 2: The rows in T1 when there is no match from the correlated predicate 
(T1.B1 = T2.B2)

In this case, all the rows in T1, including the rows where T1.A1, are qualified 
because the subquery returns an empty set and by the semantics of the NOT IN, 
all rows from the parent side qualifies as the result set, that is, the rows 
from (2.1, 2.2 and 2.3).

In conclusion, the correct result set of the above query is

{code}
T1.A1  T1.B1
-----  -----
    2      1    (1.2)
    1   null    (2.1)
 null      2    (2.2 & 2.3)
{code}

> NOT IN subquery with correlated expressions may return incorrect result
> -----------------------------------------------------------------------
>
>                 Key: SPARK-18966
>                 URL: https://issues.apache.org/jira/browse/SPARK-18966
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>              Labels: correctness
>
> {code}
> Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
> "b2").createOrReplaceTempView("t2")
> // The expected result is 1 row of (1,2) as shown in the next statement.
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show
> +---+---+
> | a1| b1|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> The two SQL statements above should return the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to