[ 
https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44431:
------------------------------
    Description: 
{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
[https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
 

  was:
{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
https://docs.google.com/document/d/15ttcB3OjGx5_WFKHB2COjQUbFHj5LrfNQv_d26o-wmI/edit


> Wrong semantics for null IN (empty list)
> ----------------------------------------
>
>                 Key: SPARK-44431
>                 URL: https://issues.apache.org/jira/browse/SPARK-44431
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Jack Chen
>            Priority: Major
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to