[
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin updated SPARK-26366:
--------------------------------
Comment: was deleted
(was: mgaido91 opened a new pull request #23372:
[SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL
as False
URL: https://github.com/apache/spark/pull/23372
## What changes were proposed in this pull request?
In `ReplaceExceptWithFilter` we do not consider properly the case in which
the condition returns NULL. Indeed, in that case, since negating NULL still
returns NULL, so it is not true the assumption that negating the condition
returns all the rows which didn't satisfy it, rows returning NULL may not be
returned. This happens when constraints inferred by
`InferFiltersFromConstraints` are not enough, as it happens with `OR`
conditions.
The rule had also problems with non-deterministic conditions: in such a
scenario, this rule would change the probability of the output.
The PR fixes these problem by:
- returning False for the condition when it is Null (in this way we do
return all the rows which didn't satisfy it);
- avoiding any transformation when the condition is non-deterministic.
## How was this patch tested?
added UTs
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
)
> Except with transform regression
> --------------------------------
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.3.2
> Reporter: Dan Osipov
> Assignee: Marco Gaido
> Priority: Major
> Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code
> to reproduce it:
>
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
> spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "[email protected]"),
> Row("1", "jane", "doe", "[email protected]"),
> Row("2", "apache", "spark", "[email protected]"),
> Row("3", "foo", "bar", null)
> )),
> StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
> ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
> toProcessDF.filter(
> (
> col("first_name").isin(Seq("john", "jane"): _*)
> and col("last_name").isin(Seq("smith", "doe"): _*)
> )
> or col("email").isin(List(): _*)
> )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|[email protected]|
> | 3| foo| bar| null|
> +---+----------+---------+----------------+{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|[email protected]|
> +---+----------+---------+----------------+{noformat}
> Note, changing the last line to
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]