Re: [PR] [SPARK-47217][SQL] Fix ambiguity check in self joins [spark]

via GitHub Tue, 12 Mar 2024 15:30:12 -0700


ahshahid commented on code in PR #45343:
URL: https://github.com/apache/spark/pull/45343#discussion_r1522039397



##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSelfJoinSuite.scala:
##########
@@ -498,4 +559,70 @@ class DataFrameSelfJoinSuite extends QueryTest with 
SharedSparkSession {
       assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1)
     }
   }
+
+  test("SPARK_47217: deduplication of project causes ambiguity in resolution") 
{
+    val df = Seq((1, 2)).toDF("a", "b")
+    val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
+    val df3 = df2.join(df, df2("bb") === df("b")).select(df2("aa"), df("a"))
+    checkAnswer(
+      df3,
+      Row(1, 1) :: Nil)
+  }
+
+  test("SPARK-47217. deduplication in nested joins with join attribute 
aliased") {
+    val df1 = Seq((1, 2)).toDF("a", "b")
+    val df2 = Seq((1, 2)).toDF("aa", "bb")
+    val df1Joindf2 = df1.join(df2, df1("a") === 
df2("aa")).select(df1("a").as("aaa"),
+      df2("aa"), df1("b"))
+
+    assertCorrectResolution(df1Joindf2.join(df1, df1Joindf2("aaa") === 
df1("a")),
+      Resolution.LeftConditionToLeftLeg, Resolution.RightConditionToRightLeg)
+
+    assertCorrectResolution(df1.join(df1Joindf2, df1Joindf2("aaa") === 
df1("a")),
+      Resolution.LeftConditionToRightLeg, Resolution.RightConditionToLeftLeg)
+
+    val proj1 = df1Joindf2.join(df1, df1Joindf2("aaa") === 
df1("a")).select(df1Joindf2("aa"),
+      df1("a")).queryExecution.analyzed.asInstanceOf[Project]
+    val join1 = proj1.child.asInstanceOf[Join]
+    assert(proj1.projectList(0).references.subsetOf(join1.left.outputSet))
+    assert(proj1.projectList(1).references.subsetOf(join1.right.outputSet))
+
+    val proj2 = df1.join(df1Joindf2, df1Joindf2("aaa") === 
df1("a")).select(df1Joindf2("aa"),
+      df1("a")).queryExecution.analyzed.asInstanceOf[Project]
+    val join2 = proj2.child.asInstanceOf[Join]
+    assert(proj2.projectList(0).references.subsetOf(join2.right.outputSet))
+    assert(proj2.projectList(1).references.subsetOf(join2.left.outputSet))
+  }
+
+  test("SPARK-47217. deduplication in nested joins without join attribute 
aliased") {
+    val df1 = Seq((1, 2)).toDF("a", "b")
+    val df2 = Seq((1, 2)).toDF("aa", "bb")
+    val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), 
df2("aa"), df1("b"))
+
+    assertCorrectResolution(df1Joindf2.join(df1, df1Joindf2("a") === df1("a")),
+      Resolution.LeftConditionToLeftLeg, Resolution.RightConditionToRightLeg)
+
+    assertCorrectResolution(df1.join(df1Joindf2, df1Joindf2("a") === df1("a")),
+      Resolution.LeftConditionToRightLeg, Resolution.RightConditionToLeftLeg)
+
+    val proj1 = df1Joindf2.join(df1, df1Joindf2("a") === 
df1("a")).select(df1Joindf2("a"),
+      df1("a")).queryExecution.analyzed.asInstanceOf[Project]

Review Comment:
   @peter-toth 
   As per this PR code change , case like :  df1Joindf2.join(df1, df1("a") === 
df1("a")) is resolved as both LHS and RHS resolving to same Df1 dataset ( which 
makes the join crosss product) , and then later this situation is handled via 
brute force through function "resolveSelfJoinCondition"
   But if the user has put df1Joindf2.join(df1, df1("a") === df1Joindf2("a")), 
then it gets handled via this PR change  as it is a valid situation.
   
   "But in a select on the join result using df1("a") should be ambigous as 
df1("a") could be selected from both legs of the join. I.e. both 
df1Joindf2.select(df1("a")) and df1.select(df1("a")) work."
   
   Based on my understanding of the above example :
   The way I interpret it is:
   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), 
df2("aa"), df1("b"))
   
   The below join condition is resolved by the HACK 
   val df4 = df1Joindf2.join(df1, df1("a") === df1("a"))
   
   where as 
   if it was
   val df4 = df1Joindf2.join(df1, df1("a") === df1Joindf2("a"))
   the above is a natural and logical resolution.
   
   df4 is a join of df1Joindf2 and df1 ( we consider only top level join)
   so in a select on df4.,  IMO  
   df4.select(df1("a) , df1Joindf2("a)), there is no ambiguity as one is being 
taken from df1 and other from df1joinDf2.
   
   Moreover, the Join Condition in itself  should not effect the output  
attributes of the  Join Plan, irrespective of how the join condition is 
interpreted ( via hack or new code path)
   
   I understand that from point of view of ExprId it can be viewed as 
ambiguity.  though that ambiguity goes away when viewed from datasetId .
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47217][SQL] Fix ambiguity check in self joins [spark]

Reply via email to