ahshahid commented on code in PR #45343:
URL: https://github.com/apache/spark/pull/45343#discussion_r1522039397
##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSelfJoinSuite.scala:
##########
@@ -498,4 +559,70 @@ class DataFrameSelfJoinSuite extends QueryTest with
SharedSparkSession {
assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
}
}
+
+ test("SPARK_47217: deduplication of project causes ambiguity in resolution")
{
+ val df = Seq((1, 2)).toDF("a", "b")
+ val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
+ val df3 = df2.join(df, df2("bb") === df("b")).select(df2("aa"), df("a"))
+ checkAnswer(
+ df3,
+ Row(1, 1) :: Nil)
+ }
+
+ test("SPARK-47217. deduplication in nested joins with join attribute
aliased") {
+ val df1 = Seq((1, 2)).toDF("a", "b")
+ val df2 = Seq((1, 2)).toDF("aa", "bb")
+ val df1Joindf2 = df1.join(df2, df1("a") ===
df2("aa")).select(df1("a").as("aaa"),
+ df2("aa"), df1("b"))
+
+ assertCorrectResolution(df1Joindf2.join(df1, df1Joindf2("aaa") ===
df1("a")),
+ Resolution.LeftConditionToLeftLeg, Resolution.RightConditionToRightLeg)
+
+ assertCorrectResolution(df1.join(df1Joindf2, df1Joindf2("aaa") ===
df1("a")),
+ Resolution.LeftConditionToRightLeg, Resolution.RightConditionToLeftLeg)
+
+ val proj1 = df1Joindf2.join(df1, df1Joindf2("aaa") ===
df1("a")).select(df1Joindf2("aa"),
+ df1("a")).queryExecution.analyzed.asInstanceOf[Project]
+ val join1 = proj1.child.asInstanceOf[Join]
+ assert(proj1.projectList(0).references.subsetOf(join1.left.outputSet))
+ assert(proj1.projectList(1).references.subsetOf(join1.right.outputSet))
+
+ val proj2 = df1.join(df1Joindf2, df1Joindf2("aaa") ===
df1("a")).select(df1Joindf2("aa"),
+ df1("a")).queryExecution.analyzed.asInstanceOf[Project]
+ val join2 = proj2.child.asInstanceOf[Join]
+ assert(proj2.projectList(0).references.subsetOf(join2.right.outputSet))
+ assert(proj2.projectList(1).references.subsetOf(join2.left.outputSet))
+ }
+
+ test("SPARK-47217. deduplication in nested joins without join attribute
aliased") {
+ val df1 = Seq((1, 2)).toDF("a", "b")
+ val df2 = Seq((1, 2)).toDF("aa", "bb")
+ val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
+
+ assertCorrectResolution(df1Joindf2.join(df1, df1Joindf2("a") === df1("a")),
+ Resolution.LeftConditionToLeftLeg, Resolution.RightConditionToRightLeg)
+
+ assertCorrectResolution(df1.join(df1Joindf2, df1Joindf2("a") === df1("a")),
+ Resolution.LeftConditionToRightLeg, Resolution.RightConditionToLeftLeg)
+
+ val proj1 = df1Joindf2.join(df1, df1Joindf2("a") ===
df1("a")).select(df1Joindf2("a"),
+ df1("a")).queryExecution.analyzed.asInstanceOf[Project]
Review Comment:
@peter-toth
As per this PR code change , case like : df1Joindf2.join(df1, df1("a") ===
df1("a")) is resolved as both LHS and RHS resolving to same Df1 dataset ( which
makes the join crosss product) , and then later this situation is handled via
brute force through function "resolveSelfJoinCondition"
But if the user has put df1Joindf2.join(df1, df1("a") === df1Joindf2("a")),
then it gets handled via this PR change as it is a valid situation.
"But in a select on the join result using df1("a") should be ambigous as
df1("a") could be selected from both legs of the join. I.e. both
df1Joindf2.select(df1("a")) and df1.select(df1("a")) work."
Based on my understanding of the above example :
The way I interpret it is:
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
The below join condition is resolved by the HACK
val df4 = df1Joindf2.join(df1, df1("a") === df1("a"))
where as
if it was
val df4 = df1Joindf2.join(df1, df1("a") === df1Joindf2("a"))
the above is a natural and logical resolution.
df4 is a join of df1Joindf2 and df1 ( we consider only top level join)
so in a select on df4., IMO
df4.select(df1("a) , df1Joindf2("a)), there is no ambiguity as one is being
taken from df1 and other from df1joinDf2.
Moreover, the Join Condition in itself should not effect the output
attributes of the Join Plan, irrespective of how the join condition is
interpreted ( via hack or new code path)
I understand that from point of view of ExprId it can be viewed as
ambiguity. though that ambiguity goes away when viewed from datasetId .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]