zhengruifeng opened a new pull request, #45214:
URL: https://github.com/apache/spark/pull/45214

   ### What changes were proposed in this pull request?
   Make `ResolveRelations` handle plan id properly
   
   
   ### Why are the changes needed?
   bug fix, before this PR:
   ```
   from pyspark.sql import functions as sf
   
   spark.range(10).withColumn("value_1", 
sf.lit(1)).write.saveAsTable("test_table_1")
   spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", 
sf.lit(2)).write.saveAsTable("test_table_2")
   
   
   df1 = spark.read.table("test_table_1")
   df2 = spark.read.table("test_table_2")
   df3 = spark.read.table("test_table_1")
   
   
   join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2)
   join2 = df3.join(join1, how="left", on=join1.index==df3.id)
   
   join2.schema
   ```
   
   fails with
   ```
   AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve 
dataframe column "id". It's probably because of illegal references like 
`df1.select(df2.col("a"))`. SQLSTATE: 42704
   ```
   
   That is due to existing plan caching in `ResolveRelations` does work with 
Spark Connect
   
   ```
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
    '[#12]Join LeftOuter, '`==`('index, 'id)                     '[#12]Join 
LeftOuter, '`==`('index, 'id)
   !:- '[#9]UnresolvedRelation [test_table_1], [], false         :- 
'[#9]SubqueryAlias spark_catalog.default.test_table_1
   !+- '[#11]Project ['index, 'value_2]                          :  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
   !   +- '[#10]Join Inner, '`==`('id, 'index)                   +- 
'[#11]Project ['index, 'value_2]
   !      :- '[#7]UnresolvedRelation [test_table_1], [], false      +- 
'[#10]Join Inner, '`==`('id, 'index)
   !      +- '[#8]UnresolvedRelation [test_table_2], [], false         :- 
'[#9]SubqueryAlias spark_catalog.default.test_table_1
   !                                                                   :  +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
   !                                                                   +- 
'[#8]SubqueryAlias spark_catalog.default.test_table_2
   !                                                                      +- 
'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false
   
   
   
   Can not resolve 'id with plan 7
   ```
   
   `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to 
the cached one
   ```
   :- '[#9]SubqueryAlias spark_catalog.default.test_table_1
      +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, 
[], false
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   yes, bug fix
   
   ### How was this patch tested?
   added ut
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   ci
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to