(spark) branch branch-4.x updated: [SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas

ruifengz Sun, 17 May 2026 18:40:42 -0700

This is an automated email from the ASF dual-hosted git repository.

zhengruifeng pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-4.x by this push:
     new 80658263d809 [SPARK-56846][DOCS][CONNECT] Document DataFrame column 
resolution behavior in spark-connect-gotchas
80658263d809 is described below

commit 80658263d80955cdc00e979644e2aee3645842e6
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Mon May 18 09:40:09 2026 +0800

    [SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior 
in spark-connect-gotchas
    
    ### What changes were proposed in this pull request?
    
    Add a new gotcha section to `docs/spark-connect-gotchas.md` describing how 
Spark Connect resolves DataFrame column references (`df["c"]`) via plan-id 
tagging, and how this diverges from Spark Classic once a column has been 
shadowed by `withColumn` or `select`.
    
    The new section covers:
    
    - The motivating example: `df.withColumn("c", 
sf.col("c").cast("string")).select(df["c"])` fails on both Spark Classic 
(`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`) and Spark Connect 
(`CANNOT_RESOLVE_DATAFRAME_COLUMN`).
    - How the SQL config `spark.sql.analyzer.strictDataFrameColumnResolution` 
(added in Spark 4.2.0 by apache/spark#55531, default `true`) controls Spark 
Connect's behavior: under strict resolution the query fails; with 
`strictDataFrameColumnResolution=false`, the analyzer still tries plan-id-based 
resolution first and only falls back to name-based resolution when that fails, 
causing the same query to succeed.
    - A "Recommended way" subsection: switch to `sf.col("c")` (an untagged name 
reference) instead of `df["c"]` (a tagged reference to `df`'s original column) 
when the column has been shadowed. Includes both Python and Scala examples.
    - The escape-hatch note: 
`spark.sql.analyzer.strictDataFrameColumnResolution=false` for users who cannot 
change the call sites.
    
    Also adds a "DataFrame column references" row to the summary table at the 
end of the document (Eagerly resolved vs Lazily resolved against plan id), 
consistent with the eager/lazy framing used throughout the file.
    
    ### Why are the changes needed?
    
    The plan-id-based column resolution path is a Spark Connect-specific 
contract that is not documented anywhere user-facing. Users migrating workloads 
to Spark Connect have encountered surprises when patterns that previously 
"worked" stop resolving, with an error class 
(`CANNOT_RESOLVE_DATAFRAME_COLUMN`) and a config 
(`strictDataFrameColumnResolution`) whose connection to their code is not 
obvious. This adds explicit guidance and a code-level mitigation alongside the 
other Connect-vs-Cl [...]
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Documentation-only change.
    
    ### How was this patch tested?
    
    Documentation-only change; no automated tests. Verified the markdown 
renders correctly and is consistent with the existing four-gotcha layout in 
`docs/spark-connect-gotchas.md`.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code (Anthropic), claude-opus-4-7
    
    Closes #55848 from zhengruifeng/SPARK-doc-col-diff.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    (cherry picked from commit 76fc01806e91b5ddaae3ae89068a83b50ebdbde2)
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 docs/spark-connect-gotchas.md | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/docs/spark-connect-gotchas.md b/docs/spark-connect-gotchas.md
index f1973133d335..135fc22312d8 100644
--- a/docs/spark-connect-gotchas.md
+++ b/docs/spark-connect-gotchas.md
@@ -73,7 +73,7 @@ Unlike query execution, Spark Classic and Spark Connect 
differ in when schema an
 
 # Common Gotchas (with Mitigations)
 
-If you are not careful about the difference between lazy vs. eager analysis, 
there are four key gotchas to be aware of: 1) overwriting temporary view names, 
2) capturing external variables in UDFs, 3) delayed error detection, and 4) 
excessive schema access on new DataFrames.
+If you are not careful about the difference between lazy vs. eager analysis, 
there are several key gotchas to be aware of.
 
 ## 1. Reusing temporary view names
 
@@ -418,6 +418,46 @@ println(structColumnFields)
 
 This approach is significantly faster when dealing with a large number of 
columns because it avoids creating and analyzing numerous DataFrames.
 
+## 5. DataFrame column references after column shadowing
+
+In Spark Connect, a DataFrame column reference such as `df["c"]` is tagged 
with the plan id of `df`. At analysis time the server resolves the reference by 
looking for the tagged ancestor in the plan and pulling the matching attribute 
from it. Spark Classic does not use plan ids; it resolves column references 
against the immediate child's output by attribute id and name.
+
+The two resolution strategies diverge once a column has been shadowed by 
another operator that produces an attribute with the same name:
+
+```python
+import pyspark.sql.functions as sf
+
+df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", sf.col("c").cast("string")).select(df["c"]).collect()
+```
+
+`withColumn("c", ...)` does not mutate `df`; it returns a new DataFrame whose 
`c` is a new attribute that hides the original. The trailing `df["c"]` still 
refers to the *original* `c` attribute, which is no longer in the projection 
list.
+
+* **Spark Classic** has always rejected this query at analysis time with 
`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`, because the 
original attribute is not present in the operator's child output.
+* **Spark Connect** rejects it with `CANNOT_RESOLVE_DATAFRAME_COLUMN` by 
default. The plan-id-tagged reference does not match any attribute in the 
current plan. But when the SQL config 
`spark.sql.analyzer.strictDataFrameColumnResolution` (added in Spark 4.2.0, 
default `true`) is set to `false`, the analyzer still tries plan-id-based 
resolution first, and only when that fails does it fall back to name-based 
resolution: the tagged `df["c"]` is then resolved by name against the projected 
`c [...]
+
+### Recommended way
+
+If you hit any of the confusing failures mentioned above, it is recommended to 
switch to `col` from `pyspark.sql.functions` first. `col("c")` is an untagged 
name reference that resolves against the most recent projection or 
`withColumn`, rather than `df["c"]` which is a tagged reference to `df`'s 
original column:
+
+```python
+import pyspark.sql.functions as sf
+
+df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", sf.col("c").cast("string")).select(sf.col("c")).collect()
+```
+
+**Scala example:**
+
+```scala
+import org.apache.spark.sql.functions._
+
+val df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", col("c").cast("string")).select(col("c")).collect()
+```
+
+If you cannot change the call sites, set 
`spark.sql.analyzer.strictDataFrameColumnResolution=false` to opt into the 
lenient name-based fallback. This is intended as an escape hatch and is not the 
default.
+
 # Summary
 
 | Aspect                | Spark Classic | Spark Connect                        
               |
@@ -428,5 +468,6 @@ This approach is significantly faster when dealing with a 
large number of column
 | **Schema access**     | Local         | Triggers RPC, and caches the schema 
on first access |
 | **Temporary views**   | Plan embedded | Name lookup                          
               |
 | **UDF serialization** | At creation   | At execution                         
               |
+| **DataFrame column references** | Eagerly resolved | Lazily resolved against 
plan id                     |
 
 The key difference is that Spark Connect defers analysis and name resolution 
to execution time.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-4.x updated: [SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas

Reply via email to