[ 
https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21497:
------------------------------------
    Description: 
Currently SparkSQL doesn't support non-deterministic joining conditions in 
Join. This kind of joining conditions can be useful in some cases, e.g., 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.

To pull non-deterministic joining conditions from Join operator, there seems no 
standard behavior. Based on the discussion on 
https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive doesn't 
have special consideration for non-deterministic join conditions and simply 
pushes down it or uses it as joining keys.

In this attempt, we initially allow non-deterministic equi join keys in Join 
operators. Because based on SparkSQL's join implementations the equi join keys 
are evaluated once on joining tables, pulling equi join keys from Join 
operators won't change the number of calls on non-deterministic expressions. It 
is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
rand(20) < b where pulling it and pushing down it will possibly change the 
number of calls of rand().

We also add a SQL conf to control this new behavior. It is disabled by default.



> Pull non-deterministic joining keys from Join operator
> ------------------------------------------------------
>
>                 Key: SPARK-21497
>                 URL: https://issues.apache.org/jira/browse/SPARK-21497
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Liang-Chi Hsieh
>
> Currently SparkSQL doesn't support non-deterministic joining conditions in 
> Join. This kind of joining conditions can be useful in some cases, e.g., 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.
> To pull non-deterministic joining conditions from Join operator, there seems 
> no standard behavior. Based on the discussion on 
> https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
> https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
> https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive 
> doesn't have special consideration for non-deterministic join conditions and 
> simply pushes down it or uses it as joining keys.
> In this attempt, we initially allow non-deterministic equi join keys in Join 
> operators. Because based on SparkSQL's join implementations the equi join 
> keys are evaluated once on joining tables, pulling equi join keys from Join 
> operators won't change the number of calls on non-deterministic expressions. 
> It is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
> rand(20) < b where pulling it and pushing down it will possibly change the 
> number of calls of rand().
> We also add a SQL conf to control this new behavior. It is disabled by 
> default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to