GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/16894
[SPARK-17897] [SQL] [BACKPORT-2.0] Fixed IsNotNull Constraint Inference Rule
### What changes were proposed in this pull request?
This PR is to backport https://github.com/apache/spark/pull/16067 to Spark
2.0
----
The `constraints` of an operator is the expressions that evaluate to `true`
for all the rows produced. That means, the expression result should be neither
`false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the
constraints, which are generated by its own predicates or propagated from the
children. The constraint can be a complex expression. For better usage of these
constraints, we try to push down `IsNotNull` to the lowest-level expressions
(i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is
null intolerant. (When the input is NULL, the null-intolerant expression always
evaluates to NULL.)
Below is the existing code we have for `IsNotNull` pushdown.
```Scala
private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] =
expr match {
case a: Attribute => Seq(a)
case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
expr.children.flatMap(scanNullIntolerantExpr)
case _ => Seq.empty[Attribute]
}
```
**`IsNotNull` itself is not null-intolerant.** It converts `null` to
`false`. If the expression does not include any `Not`-like expression, it
works; otherwise, it could generate a wrong result. This PR is to fix the above
function by removing the `IsNotNull` from the inference. After the fix, when a
constraint has a `IsNotNull` expression, we infer new attribute-specific
`IsNotNull` constraints if and only if `IsNotNull` appears in the root.
Without the fix, the following test case will return empty.
```Scala
val data = Seq[java.lang.Integer](1, null).toDF("key")
data.filter("not key is not null").show()
```
Before the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
+- LocalRelation [value#1]
```
After the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter NOT isnotnull(value#1)
+- LocalRelation [value#1]
```
### How was this patch tested?
Added a test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark isNotNull2.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16894.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16894
----
commit 11d2684b1def705ef72b0a64b13c93c8a09d3efc
Author: Xiao Li <[email protected]>
Date: 2017-02-11T07:49:17Z
fix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]