[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720464#comment-16720464 ]
ASF GitHub Bot commented on SPARK-26366: ---------------------------------------- mgaido91 opened a new pull request #23315: [SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False URL: https://github.com/apache/spark/pull/23315 ## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider the case in which the condition returns NULL. Indeed, in that case, negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it: rows returning NULL are not returned. The PR fixes this problem by returning False for the condition when it is Null. In this way we do return all the rows which didn't satisfy it. ## How was this patch tested? added UTs ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Except with transform regression > -------------------------------- > > Key: SPARK-26366 > URL: https://issues.apache.org/jira/browse/SPARK-26366 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.3.2 > Reporter: Dan Osipov > Priority: Major > > There appears to be a regression between Spark 2.2 and 2.3. Below is the code > to reproduce it: > > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val inputDF = spark.sqlContext.createDataFrame( > spark.sparkContext.parallelize(Seq( > Row("0", "john", "smith", "j...@smith.com"), > Row("1", "jane", "doe", "j...@doe.com"), > Row("2", "apache", "spark", "sp...@apache.org"), > Row("3", "foo", "bar", null) > )), > StructType(List( > StructField("id", StringType, nullable=true), > StructField("first_name", StringType, nullable=true), > StructField("last_name", StringType, nullable=true), > StructField("email", StringType, nullable=true) > )) > ) > val exceptDF = inputDF.transform( toProcessDF => > toProcessDF.filter( > ( > col("first_name").isin(Seq("john", "jane"): _*) > and col("last_name").isin(Seq("smith", "doe"): _*) > ) > or col("email").isin(List(): _*) > ) > ) > inputDF.except(exceptDF).show() > {code} > Output with Spark 2.2: > {noformat} > +---+----------+---------+----------------+ > | id|first_name|last_name| email| > +---+----------+---------+----------------+ > | 2| apache| spark|sp...@apache.org| > | 3| foo| bar| null| > +---+----------+---------+----------------+{noformat} > Output with Spark 2.3: > {noformat} > +---+----------+---------+----------------+ > | id|first_name|last_name| email| > +---+----------+---------+----------------+ > | 2| apache| spark|sp...@apache.org| > +---+----------+---------+----------------+{noformat} > Note, changing the last line to > {code:java} > inputDF.except(exceptDF.cache()).show() > {code} > produces identical output for both Spark 2.3 and 2.2 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org