[GitHub] spark pull request #17585: [SPARK-20273] [SQL] Disallow Non-deterministic Fi...

gatorsmile Sun, 09 Apr 2017 23:03:53 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/17585


    [SPARK-20273] [SQL] Disallow Non-deterministic Filter push-down into Join 
Conditions

    ## What changes were proposed in this pull request?
    ```
    sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b 
having r > 0.5").show()
    ```
    We will get the following error:
    ```
    Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most 
recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): 
java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
        at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
        at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
    ```
    Filters could be pushed down to the join conditions by the optimizer rule 
`PushPredicateThroughJoin`. However, we block users to [add non-deterministics 
conditions by the 
analyzer](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395)
 (For details, see the PR https://github.com/apache/spark/pull/7535).
    
    We should not push down non-deterministic conditions; otherwise, we should 
allow users to do it by explicitly initialize the non-deterministic 
expressions. This PR is to simply block it.
    
    ### How was this patch tested?
    Added a test case

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark joinRandCondition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17585.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17585
    
----
commit be3fb64ce60f8fb86ef8cf6f264fa3e8bf3b5f01
Author: Xiao Li <[email protected]>
Date:   2017-04-10T05:59:42Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17585: [SPARK-20273] [SQL] Disallow Non-deterministic Fi...

Reply via email to