ahshahid opened a new pull request, #45343:
URL: https://github.com/apache/spark/pull/45343
…ing joins once the plan is de-duplicated. The fix involves using Dataset ID
associated with the plans & attributes to attempt correct resolution
There seems to be a regression from Spark 2.4 as the following code no
longer succeed:
val df = Seq((1, 2)).toDF("a", "b")
val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
val df3 = df2.join(df, df2("bb") === df("b")).select(df2("aa"), df("a"))
df3.show()
=== Applying Rule
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
!'Join Inner, '`=`(bb#12, b#8) 'Join Inner, '`=`(bb#12, b#18)
:- Project [a#7 AS aa#11, b#8 AS bb#12] :- Project [a#7 AS aa#11, b#8 AS
bb#12]
: +- Project [_1#2 AS a#7, _2#3 AS b#8] : +- Project [_1#2 AS a#7, _2#3
AS b#8]
: +- LocalRelation [_1#2, _2#3] : +- LocalRelation [_1#2,
_2#3]
!+- Project [_1#2 AS a#7, _2#3 AS b#8] +- Project [_1#15 AS a#17, _2#16
AS b#18]
! +- LocalRelation [_1#2, _2#3] +- LocalRelation [_1#15,
_2#16]
and so Spark 3 thinks df("a") is ambigous:
Column a#7 are ambiguous. It's probably because you joined several Datasets
together, and some of these Datasets are the same. This column points to one of
the Datasets but Spark is unable to figure out which one. Please alias the
Datasets with different names via Dataset.as before joining them, and specify
the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" >
$"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to
disable this check.
If we disable spark.sql.analyzer.failAmbiguousSelfJoin then the real issue
reveals: Due to the deduplication the last .select(df2("aa"), df("a")) doesn't
work any more.
[MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved
attribute(s) "a" missing from "aa", "bb", "a", "b" in operator !Project [aa#11,
a#7]. Attribute(s) with the same name appear in the operation: "a".
Please check if the right attribute(s) are used. SQLSTATE: XX000;
!Project [aa#11, a#7]
+- Join Inner, (bb#12 = b#18)
:- Project [a#7 AS aa#11, b#8 AS bb#12]
: +- Project [_1#2 AS a#7, _2#3 AS b#8]
: +- LocalRelation [_1#2, _2#3]
+- Project [_1#15 AS a#17, _2#16 AS b#18]
+- LocalRelation [_1#15, _2#16]
Similar issues are seen in nested joins and AsOfJoins where same reference
Logical Plans are duplicated in the query .
### What changes were proposed in this pull request?
The PR attemps to fix the issue in following way
1) If the projection fields contain AttributeReference which are not found
in the incoming AttributeSet, and the AttributeRef metadata contains the
DatasetId info, then the AttributeRef is converted into a new
UnresolvedAttributeWithTag and the original attributeRef is passed as paramter .
2) In the ColumnResolutionHelper, to resolve the
UnresolvedAttributeRefWithTag, a new resolution logic is used:
The dataSetId from the original attribute ref's metadata is extracted.
3) The first BinaryNode contained in the LogicalPlan containing this
unresolved attribute, is found.
Then its right leg & left lag's unary nodes are checked for the presennce of
DatasetID of attribute ref, using TreeNodeTag("__datasetid").
If both the legs contain datasetId or neither contains, then resolution
exception is thrown
Else the leg which contains datasetId is used to resolve.
### Why are the changes needed?
To fix the bug as exposed by the unit tests in the PR
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Precheckin run.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]