[PR] [SPARK-47217][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

via GitHub Fri, 08 Mar 2024 14:51:23 -0800


ahshahid opened a new pull request, #45446:
URL: https://github.com/apache/spark/pull/45446


   ### What changes were proposed in this pull request?
   The basis of the change is to distinguish and resolve the ambiguity based on 
the Dataset from which column is extracted  by the user, instead of ExprIds.
   That will result in a consistent and intuitive behaviour and also logically 
correct. 
   Current code is mixing the resolution basis as sometimes using ExprId and 
sometimes indirectly using DataSet Id.
   This PR used DataSet Id present in AttributeReference's metadata to see if 
ambiguity can be resolved logically / sensibly by checking with the DataSet 
ID's of the joining DataSets.
   
   ### Why are the changes needed?
   While fixing a bug where Ambiguous Column Exception was raised ( which 
worked fine in earlier versions of spark), came across multiple situations 
where a particular nested joined DataSet involving self joins, works, but fails 
when join order is changed or a column extract from dataset involved in join, 
is treated as unambiguous when used in join condition but same causes ambiguity 
exception when used in projection ( select)
   There is also an existing test I believe which is falsely passing where 
resolution of attribute is not happening to the expected Dataset.
   For eg:
   `
   val df1 = Seq((1, 2)).toDF("a", "b")
   val df2 = Seq((1, 2)).toDF("aa", "bb")
   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
   df2("aa"), df1("b"))
   
   df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
   `
   The above works fine, but below throws Exception. The only difference 
between the two is  that the latter has `select(df1("a")`. But then `df1("a")` 
works fine as a condition
   `
   val df1 = Seq((1, 2)).toDF("a", "b")
   val df2 = Seq((1, 2)).toDF("aa", "bb")
   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
   df2("aa"), df1("b"))
   
   df1Joindf2.join(df1, df1Joindf2("a") === df1("a")).select(df1("a"))
   `
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   It is possible that any Dataset involving self joins which may have  
previously been throwing Ambiguity related exceptions are now expected to work 
, assuming the columns being extracted to be used in APIs are from DataSets 
being joined at the top most level.
   
   
   ### How was this patch tested?
   Added new tests. Making stricter assertions. Modifying the existing tests in 
DataFrameSelfJoinTest which are logically having unambiguity based on datasets 
from which columns are extracted.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-47217][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Reply via email to