[ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:
--------------------------------
    Comment: was deleted

(was: mgaido91 opened a new pull request #23315: [SPARK-26366][SQL] 
ReplaceExceptWithFilter should consider NULL as False
URL: https://github.com/apache/spark/pull/23315
 
 
   ## What changes were proposed in this pull request?
   
   In `ReplaceExceptWithFilter` we do not consider the case in which the 
condition returns NULL. Indeed, in that case, negating NULL still returns NULL, 
so it is not true the assumption that negating the condition returns all the 
rows which didn't satisfy it: rows returning NULL are not returned.
   
   The PR fixes this problem by returning False for the condition when it is 
Null. In this way we do return all the rows which didn't satisfy it.
   
   ## How was this patch tested?
   
   added UTs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
)

> Except with transform regression
> --------------------------------
>
>                 Key: SPARK-26366
>                 URL: https://issues.apache.org/jira/browse/SPARK-26366
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.2
>            Reporter: Dan Osipov
>            Assignee: Marco Gaido
>            Priority: Major
>              Labels: correctness
>             Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
>     Row("0", "john", "smith", "j...@smith.com"),
>     Row("1", "jane", "doe", "j...@doe.com"),
>     Row("2", "apache", "spark", "sp...@apache.org"),
>     Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
>     StructField("id", StringType, nullable=true),
>     StructField("first_name", StringType, nullable=true),
>     StructField("last_name", StringType, nullable=true),
>     StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>       (
>         col("first_name").isin(Seq("john", "jane"): _*)
>           and col("last_name").isin(Seq("smith", "doe"): _*)
>       )
>       or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+----------+---------+----------------+{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|sp...@apache.org|
> +---+----------+---------+----------------+{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to