Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18652
It is a good question. Based on previous discussion, I think Join operator
has no unique result in the non-deterministic case. The migration issue from
Hive is because this kind of queries can't run by current SparkSQL. Whether we
exactly follow Hive behavior is not real pain, if I didn't miss something.
Take the non-deterministic join in the query seen in the discussion thread
at
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973
as an example:
SELECT a.col1
FROM tbl1 a
LEFT OUTER JOIN tbl2 b
ON
CASE
WHEN a.col2 IS NULL
THEN cast(rand(9)*1000 - 9999999999 as string)
ELSE
a.col2 END
= b.col3;
Where the different join result (the randomized value replacing NULL values
of a.col2) actually doesn't matter, because it only retain a.col1 finally. The
purpose of this query is to mitigate skew (huge NULLs) data when joining.
That said the columns from non-deterministic join conditions are not really
useful at this cases. I even think it may be also the reason Hive doesn't put
special consideration on non-deterministic expression when joining. As there's
no unique result, users go to take care what they do with it.
I am not sure if I convey the idea clearly.
As an analogy, for a query like `SELECT a.col1 FROM table a WHERE rand() >
0.5`. Even our rand generates different numbers than Hive so the final result
is different. It still helps migration issue from Hive workloads using rand.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]