cloud-fan commented on a change in pull request #25107: [SPARK-28344][SQL]
detect ambiguous self-join and fail the query
URL: https://github.com/apache/spark/pull/25107#discussion_r308577896
##########
File path: docs/sql-migration-guide-upgrade.md
##########
@@ -155,12 +155,14 @@ license: |
- Since Spark 3.0, 0-argument Java UDF is executed in the executor side
identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java
UDF alone was executed in the driver side, and the result was propagated to
executors, which might be more performant in some cases but caused
inconsistency with a correctness issue in some cases.
+ - Since Spark 3.0, Dataset query fails if it contains ambiguous column
reference that is caused by self join. A typical example: `val df1 = ...; val
df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns
empty result which is quite confusing. This is because Spark cannot resolve
Dataset column references that point to tables being self joined, and
`df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior
before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to
`false`.
Review comment:
> describe the other way to correct the problem is by using column aliases
(with an example)
The error message contains it, see
https://github.com/apache/spark/pull/25107/files#diff-72682666ae0e00b0be514f6867838be5R143
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]