ahshahid opened a new pull request, #45446:
URL: https://github.com/apache/spark/pull/45446
### What changes were proposed in this pull request?
The basis of the change is to distinguish and resolve the ambiguity based on
the Dataset from which column is extracted by the user, instead of ExprIds.
That will result in a consistent and intuitive behaviour and also logically
correct.
Current code is mixing the resolution basis as sometimes using ExprId and
sometimes indirectly using DataSet Id.
This PR used DataSet Id present in AttributeReference's metadata to see if
ambiguity can be resolved logically / sensibly by checking with the DataSet
ID's of the joining DataSets.
### Why are the changes needed?
While fixing a bug where Ambiguous Column Exception was raised ( which
worked fine in earlier versions of spark), came across multiple situations
where a particular nested joined DataSet involving self joins, works, but fails
when join order is changed or a column extract from dataset involved in join,
is treated as unambiguous when used in join condition but same causes ambiguity
exception when used in projection ( select)
There is also an existing test I believe which is falsely passing where
resolution of attribute is not happening to the expected Dataset.
For eg:
`
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
`
The above works fine, but below throws Exception. The only difference
between the two is that the latter has `select(df1("a")`. But then `df1("a")`
works fine as a condition
`
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
df1Joindf2.join(df1, df1Joindf2("a") === df1("a")).select(df1("a"))
`
### Does this PR introduce _any_ user-facing change?
Yes.
It is possible that any Dataset involving self joins which may have
previously been throwing Ambiguity related exceptions are now expected to work
, assuming the columns being extracted to be used in APIs are from DataSets
being joined at the top most level.
### How was this patch tested?
Added new tests. Making stricter assertions. Modifying the existing tests in
DataFrameSelfJoinTest which are logically having unambiguity based on datasets
from which columns are extracted.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]