[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

viirya Wed, 26 Jul 2017 00:20:14 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    It is a good question. Based on previous discussion, I think Join operator 
has no unique result in the non-deterministic case. The migration issue from 
Hive is because this kind of queries can't run by current SparkSQL. Whether we 
exactly follow Hive behavior is not real pain, if I didn't miss something.
    
    Take the non-deterministic join in the query seen in the discussion thread 
at 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973
 as an example:
    
        SELECT a.col1
        FROM tbl1 a
        LEFT OUTER JOIN tbl2 b
        ON
                CASE
                        WHEN a.col2 IS NULL
                                THEN cast(rand(9)*1000 - 9999999999 as string)
                        ELSE
                                a.col2 END
                = b.col3;
    
    Where the different join result (the randomized value replacing NULL values 
of a.col2) actually doesn't matter, because it only retain a.col1 finally. The 
purpose of this query is to mitigate skew (huge NULLs) data when joining.
    
    That said the columns from non-deterministic join conditions are not really 
useful at this cases. I even think it may be also the reason Hive doesn't put 
special consideration on non-deterministic expression when joining. As there's 
no unique result, users go to take care what they do with it.
    
    I am not sure if I convey the idea clearly.
    
    As an analogy, for a query like `SELECT a.col1 FROM table a WHERE rand() > 
0.5`. Even our rand generates different numbers than Hive so the final result 
is different. It still helps migration issue from Hive workloads using rand.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Reply via email to