cloud-fan commented on a change in pull request #25107: [SPARK-28344][SQL] 
detect ambiguous self-join and fail the query
URL: https://github.com/apache/spark/pull/25107#discussion_r308577896
 
 

 ##########
 File path: docs/sql-migration-guide-upgrade.md
 ##########
 @@ -155,12 +155,14 @@ license: |
 
   - Since Spark 3.0, 0-argument Java UDF is executed in the executor side 
identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java 
UDF alone was executed in the driver side, and the result was propagated to 
executors, which might be more performant in some cases but caused 
inconsistency with a correctness issue in some cases.
 
+  - Since Spark 3.0, Dataset query fails if it contains ambiguous column 
reference that is caused by self join. A typical example: `val df1 = ...; val 
df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns 
empty result which is quite confusing. This is because Spark cannot resolve 
Dataset column references that point to tables being self joined, and 
`df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior 
before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to 
`false`.
 
 Review comment:
   > describe the other way to correct the problem is by using column aliases 
(with an example)
   
   The error message contains it, see 
https://github.com/apache/spark/pull/25107/files#diff-72682666ae0e00b0be514f6867838be5R143

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to