peter-toth commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527145045
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with
DataTypeErrorsBase {
assert(q.children.length == 1)
q.children.head.output
},
+
+ resolveOnDatasetId = (datasetid: Long, name: String) => {
Review Comment:
I have 2 notes to the above:
- @ahshahid, the following worked in Spark 3.5 but failes in 4.0 after
https://github.com/apache/spark/pull/41347 for the same reason as described in
the old https://github.com/apache/spark/pull/45343:
```
test("SPARK-47217: DeduplicateRelations issue 4") {
Seq(true, false).foreach(fail =>
withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key ->
fail.toString) {
val df = Seq((1, 2)).toDF("a", "b")
val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
val df3 = df.select(df("a"), df("b"))
val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"),
df("a")) // `df("a")` doesn't come the the join's direct children, but from
it's descendants
checkAnswer(df4, Row(1, 1) :: Nil)
}
)
}
```
In this test `df("a")`'s expression id gets deduplicated (in the right
side of the join) and so the original expression id doesn't work in the final
select. But I think this test case proves that we need
`tryResolveDataFrameColumns()` like deep recursion when we try resolving by
plan ids.
- @cloud-fan, I think there is different problem with
`tryResolveDataFrameColumns()`.
I did try to use it for "re-resolving" attribute references that became
invalid in a quick test:
https://github.com/peter-toth/spark/commit/a873c24372b1d87184149bc5e65c96da1b0db879,
but a few test cases failed due to some logicalplans can belong to multiple
datasets. E.g. if we have:
```
val df = Seq((1, 2)).toDF("a", "b")
val df2 = df.toDF()
```
then the `df` and `df2` shares the same logicalplan instance and we can't
store multiple ids in the current `LogicalPlan.PLAN_ID_TAG`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]