This is an automated email from the ASF dual-hosted git repository.
zhengruifeng pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.x by this push:
new 80658263d809 [SPARK-56846][DOCS][CONNECT] Document DataFrame column
resolution behavior in spark-connect-gotchas
80658263d809 is described below
commit 80658263d80955cdc00e979644e2aee3645842e6
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Mon May 18 09:40:09 2026 +0800
[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior
in spark-connect-gotchas
### What changes were proposed in this pull request?
Add a new gotcha section to `docs/spark-connect-gotchas.md` describing how
Spark Connect resolves DataFrame column references (`df["c"]`) via plan-id
tagging, and how this diverges from Spark Classic once a column has been
shadowed by `withColumn` or `select`.
The new section covers:
- The motivating example: `df.withColumn("c",
sf.col("c").cast("string")).select(df["c"])` fails on both Spark Classic
(`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`) and Spark Connect
(`CANNOT_RESOLVE_DATAFRAME_COLUMN`).
- How the SQL config `spark.sql.analyzer.strictDataFrameColumnResolution`
(added in Spark 4.2.0 by apache/spark#55531, default `true`) controls Spark
Connect's behavior: under strict resolution the query fails; with
`strictDataFrameColumnResolution=false`, the analyzer still tries plan-id-based
resolution first and only falls back to name-based resolution when that fails,
causing the same query to succeed.
- A "Recommended way" subsection: switch to `sf.col("c")` (an untagged name
reference) instead of `df["c"]` (a tagged reference to `df`'s original column)
when the column has been shadowed. Includes both Python and Scala examples.
- The escape-hatch note:
`spark.sql.analyzer.strictDataFrameColumnResolution=false` for users who cannot
change the call sites.
Also adds a "DataFrame column references" row to the summary table at the
end of the document (Eagerly resolved vs Lazily resolved against plan id),
consistent with the eager/lazy framing used throughout the file.
### Why are the changes needed?
The plan-id-based column resolution path is a Spark Connect-specific
contract that is not documented anywhere user-facing. Users migrating workloads
to Spark Connect have encountered surprises when patterns that previously
"worked" stop resolving, with an error class
(`CANNOT_RESOLVE_DATAFRAME_COLUMN`) and a config
(`strictDataFrameColumnResolution`) whose connection to their code is not
obvious. This adds explicit guidance and a code-level mitigation alongside the
other Connect-vs-Cl [...]
### Does this PR introduce _any_ user-facing change?
No. Documentation-only change.
### How was this patch tested?
Documentation-only change; no automated tests. Verified the markdown
renders correctly and is consistent with the existing four-gotcha layout in
`docs/spark-connect-gotchas.md`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Closes #55848 from zhengruifeng/SPARK-doc-col-diff.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
(cherry picked from commit 76fc01806e91b5ddaae3ae89068a83b50ebdbde2)
Signed-off-by: Ruifeng Zheng <[email protected]>
---
docs/spark-connect-gotchas.md | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/docs/spark-connect-gotchas.md b/docs/spark-connect-gotchas.md
index f1973133d335..135fc22312d8 100644
--- a/docs/spark-connect-gotchas.md
+++ b/docs/spark-connect-gotchas.md
@@ -73,7 +73,7 @@ Unlike query execution, Spark Classic and Spark Connect
differ in when schema an
# Common Gotchas (with Mitigations)
-If you are not careful about the difference between lazy vs. eager analysis,
there are four key gotchas to be aware of: 1) overwriting temporary view names,
2) capturing external variables in UDFs, 3) delayed error detection, and 4)
excessive schema access on new DataFrames.
+If you are not careful about the difference between lazy vs. eager analysis,
there are several key gotchas to be aware of.
## 1. Reusing temporary view names
@@ -418,6 +418,46 @@ println(structColumnFields)
This approach is significantly faster when dealing with a large number of
columns because it avoids creating and analyzing numerous DataFrames.
+## 5. DataFrame column references after column shadowing
+
+In Spark Connect, a DataFrame column reference such as `df["c"]` is tagged
with the plan id of `df`. At analysis time the server resolves the reference by
looking for the tagged ancestor in the plan and pulling the matching attribute
from it. Spark Classic does not use plan ids; it resolves column references
against the immediate child's output by attribute id and name.
+
+The two resolution strategies diverge once a column has been shadowed by
another operator that produces an attribute with the same name:
+
+```python
+import pyspark.sql.functions as sf
+
+df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", sf.col("c").cast("string")).select(df["c"]).collect()
+```
+
+`withColumn("c", ...)` does not mutate `df`; it returns a new DataFrame whose
`c` is a new attribute that hides the original. The trailing `df["c"]` still
refers to the *original* `c` attribute, which is no longer in the projection
list.
+
+* **Spark Classic** has always rejected this query at analysis time with
`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`, because the
original attribute is not present in the operator's child output.
+* **Spark Connect** rejects it with `CANNOT_RESOLVE_DATAFRAME_COLUMN` by
default. The plan-id-tagged reference does not match any attribute in the
current plan. But when the SQL config
`spark.sql.analyzer.strictDataFrameColumnResolution` (added in Spark 4.2.0,
default `true`) is set to `false`, the analyzer still tries plan-id-based
resolution first, and only when that fails does it fall back to name-based
resolution: the tagged `df["c"]` is then resolved by name against the projected
`c [...]
+
+### Recommended way
+
+If you hit any of the confusing failures mentioned above, it is recommended to
switch to `col` from `pyspark.sql.functions` first. `col("c")` is an untagged
name reference that resolves against the most recent projection or
`withColumn`, rather than `df["c"]` which is a tagged reference to `df`'s
original column:
+
+```python
+import pyspark.sql.functions as sf
+
+df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", sf.col("c").cast("string")).select(sf.col("c")).collect()
+```
+
+**Scala example:**
+
+```scala
+import org.apache.spark.sql.functions._
+
+val df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", col("c").cast("string")).select(col("c")).collect()
+```
+
+If you cannot change the call sites, set
`spark.sql.analyzer.strictDataFrameColumnResolution=false` to opt into the
lenient name-based fallback. This is intended as an escape hatch and is not the
default.
+
# Summary
| Aspect | Spark Classic | Spark Connect
|
@@ -428,5 +468,6 @@ This approach is significantly faster when dealing with a
large number of column
| **Schema access** | Local | Triggers RPC, and caches the schema
on first access |
| **Temporary views** | Plan embedded | Name lookup
|
| **UDF serialization** | At creation | At execution
|
+| **DataFrame column references** | Eagerly resolved | Lazily resolved against
plan id |
The key difference is that Spark Connect defers analysis and name resolution
to execution time.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]