[PR] SPARK-47217. bug fix for exception thrown in reused dataframes involv… [spark]

via GitHub Thu, 29 Feb 2024 16:35:57 -0800


ahshahid opened a new pull request, #45343:
URL: https://github.com/apache/spark/pull/45343


   …ing joins once the plan is de-duplicated. The fix involves using Dataset ID 
associated with the plans & attributes to attempt correct resolution
   
   There seems to be a regression from Spark 2.4 as the following code no 
longer succeed:
   
   val df = Seq((1, 2)).toDF("a", "b")
   val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
   val df3 = df2.join(df, df2("bb") === df("b")).select(df2("aa"), df("a"))
   df3.show()
   
   
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
   !'Join Inner, '`=`(bb#12, b#8)              'Join Inner, '`=`(bb#12, b#18)
    :- Project [a#7 AS aa#11, b#8 AS bb#12]    :- Project [a#7 AS aa#11, b#8 AS 
bb#12]
    :  +- Project [_1#2 AS a#7, _2#3 AS b#8]   :  +- Project [_1#2 AS a#7, _2#3 
AS b#8]
    :     +- LocalRelation [_1#2, _2#3]        :     +- LocalRelation [_1#2, 
_2#3]
   !+- Project [_1#2 AS a#7, _2#3 AS b#8]      +- Project [_1#15 AS a#17, _2#16 
AS b#18]
   !   +- LocalRelation [_1#2, _2#3]              +- LocalRelation [_1#15, 
_2#16]
   and so Spark 3 thinks df("a") is ambigous:
   Column a#7 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via Dataset.as before joining them, and specify 
the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > 
$"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to 
disable this check.
   
   If we disable spark.sql.analyzer.failAmbiguousSelfJoin then the real issue 
reveals: Due to the deduplication the last .select(df2("aa"), df("a")) doesn't 
work any more.
   
   [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved 
attribute(s) "a" missing from "aa", "bb", "a", "b" in operator !Project [aa#11, 
a#7]. Attribute(s) with the same name appear in the operation: "a".
   Please check if the right attribute(s) are used. SQLSTATE: XX000;
   !Project [aa#11, a#7]
   +- Join Inner, (bb#12 = b#18)
      :- Project [a#7 AS aa#11, b#8 AS bb#12]
      :  +- Project [_1#2 AS a#7, _2#3 AS b#8]
      :     +- LocalRelation [_1#2, _2#3]
      +- Project [_1#15 AS a#17, _2#16 AS b#18]
         +- LocalRelation [_1#15, _2#16]
   
   Similar issues are seen in nested joins and AsOfJoins where same reference 
Logical Plans are duplicated in the query .
   
   ### What changes were proposed in this pull request?
   The PR attemps to fix the issue in following way
   1) If the projection fields contain AttributeReference which are not found 
in the incoming AttributeSet,  and the AttributeRef metadata contains the 
DatasetId info, then the AttributeRef is converted into a new 
UnresolvedAttributeWithTag and the original attributeRef is passed as paramter .
   
   2) In the ColumnResolutionHelper,  to resolve the 
UnresolvedAttributeRefWithTag, a new resolution logic is used:
   The dataSetId from the original attribute ref's metadata is extracted.
   
   3) The first  BinaryNode  contained in the LogicalPlan containing this 
unresolved attribute,  is found.
   Then its right leg & left lag's unary nodes are checked for the presennce of 
DatasetID of attribute ref, using TreeNodeTag("__datasetid").
   If both the legs contain datasetId or neither contains, then resolution 
exception is thrown
   Else the leg which contains datasetId is used to resolve.
   
   
   ### Why are the changes needed?
   To fix the bug as exposed by the unit tests in the PR
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Precheckin run.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] SPARK-47217. bug fix for exception thrown in reused dataframes involv… [spark]

Reply via email to