[ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734363#comment-16734363
 ] 

Reynold Xin commented on SPARK-26366:
-------------------------------------

Please make sure you guys tag these tickets with correctness label!

> Except with transform regression
> --------------------------------
>
>                 Key: SPARK-26366
>                 URL: https://issues.apache.org/jira/browse/SPARK-26366
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.2
>            Reporter: Dan Osipov
>            Assignee: Marco Gaido
>            Priority: Major
>              Labels: correctness
>             Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
>     Row("0", "john", "smith", "j...@smith.com"),
>     Row("1", "jane", "doe", "j...@doe.com"),
>     Row("2", "apache", "spark", "sp...@apache.org"),
>     Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
>     StructField("id", StringType, nullable=true),
>     StructField("first_name", StringType, nullable=true),
>     StructField("last_name", StringType, nullable=true),
>     StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>       (
>         col("first_name").isin(Seq("john", "jane"): _*)
>           and col("last_name").isin(Seq("smith", "doe"): _*)
>       )
>       or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+----------+---------+----------------+{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|sp...@apache.org|
> +---+----------+---------+----------------+{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to