GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/16894

    [SPARK-17897] [SQL] [BACKPORT-2.0] Fixed IsNotNull Constraint Inference Rule

    ### What changes were proposed in this pull request?
    
    This PR is to backport https://github.com/apache/spark/pull/16067 to Spark 
2.0
    
    ----
    
    The `constraints` of an operator is the expressions that evaluate to `true` 
for all the rows produced. That means, the expression result should be neither 
`false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the 
constraints, which are generated by its own predicates or propagated from the 
children. The constraint can be a complex expression. For better usage of these 
constraints, we try to push down `IsNotNull` to the lowest-level expressions 
(i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is 
null intolerant. (When the input is NULL, the null-intolerant expression always 
evaluates to NULL.)
    
    Below is the existing code we have for `IsNotNull` pushdown.
    ```Scala
      private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = 
expr match {
        case a: Attribute => Seq(a)
        case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
          expr.children.flatMap(scanNullIntolerantExpr)
        case _ => Seq.empty[Attribute]
      }
    ```
    
    **`IsNotNull` itself is not null-intolerant.** It converts `null` to 
`false`. If the expression does not include any `Not`-like expression, it 
works; otherwise, it could generate a wrong result. This PR is to fix the above 
function by removing the `IsNotNull` from the inference. After the fix, when a 
constraint has a `IsNotNull` expression, we infer new attribute-specific 
`IsNotNull` constraints if and only if `IsNotNull` appears in the root. 
    
    Without the fix, the following test case will return empty.
    ```Scala
    val data = Seq[java.lang.Integer](1, null).toDF("key")
    data.filter("not key is not null").show()
    ```
    Before the fix, the optimized plan is like
    ```
    == Optimized Logical Plan ==
    Project [value#1 AS key#3]
    +- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
       +- LocalRelation [value#1]
    ```
    
    After the fix, the optimized plan is like
    ```
    == Optimized Logical Plan ==
    Project [value#1 AS key#3]
    +- Filter NOT isnotnull(value#1)
       +- LocalRelation [value#1]
    ```
    
    ### How was this patch tested?
    Added a test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark isNotNull2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16894.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16894
    
----
commit 11d2684b1def705ef72b0a64b13c93c8a09d3efc
Author: Xiao Li <gatorsm...@gmail.com>
Date:   2017-02-11T07:49:17Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to