zhengruifeng commented on code in PR #55848:
URL: https://github.com/apache/spark/pull/55848#discussion_r3255908276


##########
docs/spark-connect-gotchas.md:
##########
@@ -73,7 +73,7 @@ Unlike query execution, Spark Classic and Spark Connect 
differ in when schema an
 
 # Common Gotchas (with Mitigations)
 
-If you are not careful about the difference between lazy vs. eager analysis, 
there are four key gotchas to be aware of: 1) overwriting temporary view names, 
2) capturing external variables in UDFs, 3) delayed error detection, and 4) 
excessive schema access on new DataFrames.
+If you are not careful about the difference between lazy vs. eager analysis, 
there are five key gotchas to be aware of: 1) overwriting temporary view names, 
2) capturing external variables in UDFs, 3) delayed error detection, 4) 
excessive schema access on new DataFrames, and 5) DataFrame column references 
after a column is shadowed.

Review Comment:
   Good point — dropped the count and the explicit enumeration in 9483d8c4cbf 
since the section headers already list the topics.



##########
docs/spark-connect-gotchas.md:
##########
@@ -418,6 +418,46 @@ println(structColumnFields)
 
 This approach is significantly faster when dealing with a large number of 
columns because it avoids creating and analyzing numerous DataFrames.
 
+## 5. DataFrame column references after column shadowing
+
+In Spark Connect, a DataFrame column reference such as `df["c"]` is tagged 
with the plan id of `df`. At analysis time the server resolves the reference by 
looking for the tagged ancestor in the plan and pulling the matching attribute 
from it. Spark Classic does not use plan ids; it resolves column references 
against the immediate child's output by attribute id and name.
+
+The two resolution strategies diverge once a column has been shadowed by 
another operator that produces an attribute with the same name:
+
+```python
+import pyspark.sql.functions as sf
+
+df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", sf.col("c").cast("string")).select(df["c"]).collect()
+```
+
+`withColumn("c", ...)` does not mutate `df`; it returns a new DataFrame whose 
`c` is a new attribute that hides the original. The trailing `df["c"]` still 
refers to the *original* `c` attribute, which is no longer in the projection 
list.
+
+* **Spark Classic** has always rejected this query at analysis time with 
`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`, because the 
original attribute is not present in the operator's child output.
+* **Spark Connect** rejects it with `CANNOT_RESOLVE_DATAFRAME_COLUMN` by 
default. The plan-id-tagged reference does not match any attribute in the 
current plan. But when the SQL config 
`spark.sql.analyzer.strictDataFrameColumnResolution` (added in Spark 4.2.0, 
default `true`) is set to `false`, the analyzer still tries plan-id-based 
resolution first, and only when that fails does it fall back to name-based 
resolution: the tagged `df["c"]` is then resolved by name against the projected 
`c` from `withColumn`, and the query succeeds.
+
+### Recommended way
+
+If you hit any of the confusing failures mentioned above, it is recommended to 
switch to `sf.col` first. `sf.col("c")` is an untagged name reference that 
resolves against the most recent projection or `withColumn`, rather than 
`df["c"]` which is a tagged reference to `df`'s original column:

Review Comment:
   Updated in 9483d8c4cbf to refer to `col` from `pyspark.sql.functions` in the 
prose so the reference is self-contained without depending on the earlier code 
block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to