Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

via GitHub Sat, 16 Mar 2024 02:59:43 -0700


peter-toth commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527145045



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have 2 notes to the above:
   - @ahshahid, the following worked in Spark 3.5 but failes in 4.0 after 
https://github.com/apache/spark/pull/41347 for the same reason as described in 
the old https://github.com/apache/spark/pull/45343:
     ```
     test("SPARK-47217: DeduplicateRelations issue 4") {
       Seq(true, false).foreach(fail =>
         withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> 
fail.toString) {
           val df = Seq((1, 2)).toDF("a", "b")
           val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
           val df3 = df.select(df("a"), df("b"))
           val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"), 
df("a")) // `df("a")` doesn't come the the join's direct children, but from 
it's descendants 
           checkAnswer(df4, Row(1, 1) :: Nil)
         }
       )
     }
     ```
     In this test `df("a")`'s expression id gets deduplicated (in the right 
side of the join) and so the original expression id doesn't work in the final 
select. But I think this test case proves that we need 
`tryResolveDataFrameColumns()` like deep recursion when we try resolving by 
plan ids.
   - @cloud-fan, I think there is different problem with 
`tryResolveDataFrameColumns()`.
     I did try to use it for "re-resolving" attribute references that became 
invalid in a quick test: 
https://github.com/peter-toth/spark/commit/a873c24372b1d87184149bc5e65c96da1b0db879,
 but a few test cases failed due to some logicalplans can belong to multiple 
datasets. E.g. if we have:
     ```
     val df = Seq((1, 2)).toDF("a", "b")
     val df2 = df.toDF()
     ``` 
     then the `df` and `df2` shares the same logicalplan instance and we can't 
store multiple ids in the current `LogicalPlan.PLAN_ID_TAG`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Reply via email to