(spark) branch master updated: [SPARK-55179][PYTHON][CONNECT] Skip eager column name validation in `df.col_name`

ruifengz Tue, 27 Jan 2026 19:29:00 -0800

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 43529889f240 [SPARK-55179][PYTHON][CONNECT] Skip eager column name 
validation in `df.col_name`
43529889f240 is described below

commit 43529889f24011c3df1d308e8b673967818b7c33
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Wed Jan 28 11:28:41 2026 +0800

    [SPARK-55179][PYTHON][CONNECT] Skip eager column name validation in 
`df.col_name`
    
    ### What changes were proposed in this pull request?
    Follow up of https://github.com/apache/spark/pull/51400, also skip eager 
column name validation in `df.col_name`
    
    ### Why are the changes needed?
    to be consistent with `df["col_name"]`
    
    ### Does this PR introduce _any_ user-facing change?
    yes, `df.bad_column` failure will be delayed to analysis or execution
    
    before:
    ```
    In [1]: df = spark.range(10)
    
    In [2]: df.abc
    ---------------------------------------------------------------------------
    ...
    
    PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `abc` is not 
supported.
    
    In [3]: df._abc
    ---------------------------------------------------------------------------
    ...
    
    PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `_abc` is not 
supported.
    
    In [4]: df.__abc
    ---------------------------------------------------------------------------
    ...
    
    PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `__abc` is not 
supported.
    ```
    
    after:
    ```
    In [1]: df = spark.range(10)
    
    In [2]: df.abc
    Out[2]: Column<'abc'>
    
    In [3]: df._abc
    Out[3]: Column<'_abc'>
    
    In [4]: df.__abc
    ---------------------------------------------------------------------------
    ...
    
    PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `__abc` is not 
supported.
    ```
    
    ### How was this patch tested?
    ci
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #53059 from zhengruifeng/lazy_df_dot_col2.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 .../source/migration_guide/pyspark_upgrade.rst     |  3 ++-
 python/pyspark/sql/connect/dataframe.py            | 18 ++++++++++++----
 .../sql/tests/connect/test_connect_basic.py        | 24 +++++++++++++++++++++-
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index 2a6d9c55d2ff..dba2a6266939 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -22,6 +22,7 @@ Upgrading PySpark
 Upgrading from PySpark 4.1 to 4.2
 ---------------------------------
 * In Spark 4.2, the minimum supported version for PyArrow has been raised from 
15.0.0 to 18.0.0 in PySpark.
+* In Spark 4.2, ``DataFrame.__getattr__`` on Spark Connect Python Client no 
longer eagerly validate the column name. To restore the legacy behavior, set 
``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
 * In Spark 4.2, columnar data exchange between PySpark and the JVM uses Apache 
Arrow by default. The configuration 
``spark.sql.execution.arrow.pyspark.enabled`` now defaults to true. To restore 
the legacy (non-Arrow) row-based data exchange, set 
``spark.sql.execution.arrow.pyspark.enabled`` to ``false``.
 * In Spark 4.2, regular Python UDFs are Arrow-optimized by default. The 
configuration ``spark.sql.execution.pythonUDF.arrow.enabled`` now defaults to 
true. To restore the legacy behavior for Python UDF execution, set 
``spark.sql.execution.pythonUDF.arrow.enabled`` to ``false``.
 * In Spark 4.2, regular Python UDTFs are Arrow-optimized by default. The 
configuration ``spark.sql.execution.pythonUDTF.arrow.enabled`` now defaults to 
true. To restore the legacy behavior for Python UDTF execution, set 
``spark.sql.execution.pythonUDTF.arrow.enabled`` to ``false``.
@@ -32,7 +33,7 @@ Upgrading from PySpark 4.0 to 4.1
 * In Spark 4.1, Python 3.9 support was dropped in PySpark.
 * In Spark 4.1, the minimum supported version for PyArrow has been raised from 
11.0.0 to 15.0.0 in PySpark.
 * In Spark 4.1, the minimum supported version for Pandas has been raised from 
2.0.0 to 2.2.0 in PySpark.
-* In Spark 4.1, ``DataFrame['name']`` on Spark Connect Python Client no longer 
eagerly validate the column name. To restore the legacy behavior, set 
``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
+* In Spark 4.1, ``DataFrame.__getitem__`` on Spark Connect Python Client no 
longer eagerly validate the column name. To restore the legacy behavior, set 
``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
 * In Spark 4.1, Arrow-optimized Python UDF supports UDT input / output instead 
of falling back to the regular UDF. To restore the legacy behavior, set 
``spark.sql.execution.pythonUDF.arrow.legacy.fallbackOnUDT`` to ``true``.
 * In Spark 4.1, unnecessary conversion to pandas instances is removed when 
``spark.sql.execution.pythonUDF.arrow.enabled`` is enabled. As a result, the 
type coercion changes when the produced output has a schema different from the 
specified schema. To restore the previous behavior, enable 
``spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled``.
 * In Spark 4.1, unnecessary conversion to pandas instances is removed when 
``spark.sql.execution.pythonUDTF.arrow.enabled`` is enabled. As a result, the 
type coercion changes when the produced output has a schema different from the 
specified schema. To restore the previous behavior, enable 
``spark.sql.legacy.execution.pythonUDTF.pandas.conversion.enabled``.
diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index fa3b1a842090..0df13c1020d7 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1736,10 +1736,20 @@ class DataFrame(ParentDataFrame):
                 errorClass="JVM_ATTRIBUTE_NOT_SUPPORTED", 
messageParameters={"attr_name": name}
             )
 
-        if name not in self.columns:
-            raise PySparkAttributeError(
-                errorClass="ATTRIBUTE_NOT_SUPPORTED", 
messageParameters={"attr_name": name}
-            )
+        # Only eagerly validate the column name when:
+        # 1, PYSPARK_VALIDATE_COLUMN_NAME_LEGACY is set 1; or
+        # 2, the name starts with '__', because it is likely a python internal 
method and
+        # an AttributeError might be expected to check whether the attribute 
exists.
+        # For example:
+        # pickle/cloudpickle need to check whether method '__setstate__' is 
defined or not,
+        # and it internally invokes __getattr__("__setstate__").
+        # Returning a dataframe column self._col("__setstate__") in this case 
will break
+        # the serialization of connect dataframe and features built atop it 
(e.g. FEB).
+        if os.environ.get("PYSPARK_VALIDATE_COLUMN_NAME_LEGACY") == "1" or 
name.startswith("__"):
+            if name not in self.columns:
+                raise PySparkAttributeError(
+                    errorClass="ATTRIBUTE_NOT_SUPPORTED", 
messageParameters={"attr_name": name}
+                )
 
         return self._col(name)
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index de470b8ef26f..5e2a5bf97796 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -143,6 +143,16 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
         cdf2 = loads(data)
         self.assertEqual(cdf.collect(), cdf2.collect())
 
+    def test_serialization_II(self):
+        from pyspark.serializers import CPickleSerializer
+
+        pickle_ser = CPickleSerializer()
+
+        cdf = self.connect.range(10)
+        data = pickle_ser.dumps(cdf)
+        cdf2 = pickle_ser.loads(data)
+        self.assertEqual(cdf.collect(), cdf2.collect())
+
     def test_window_spec_serialization(self):
         from pyspark.sql.connect.window import Window
         from pyspark.serializers import CPickleSerializer
@@ -164,7 +174,19 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
         self.assertEqual(type(sdf._simple_extension), 
type(cdf._simple_extension))
 
         self.assertTrue(hasattr(cdf, "_simple_extension"))
-        self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit"))
+
+        with self.temp_env({"PYSPARK_VALIDATE_COLUMN_NAME_LEGACY": None}):
+            self.assertTrue(hasattr(cdf, "_simple_extension_does_not_exsit"))
+            self.assertIsInstance(getattr(cdf, 
"_simple_extension_does_not_exsit"), Column)
+            self.assertIsInstance(cdf._simple_extension_does_not_exsit, Column)
+
+            # For name starting with '__', still validate it eagerly
+            self.assertFalse(hasattr(cdf, "__simple_extension_does_not_exsit"))
+
+        with self.temp_env({"PYSPARK_VALIDATE_COLUMN_NAME_LEGACY": "1"}):
+            self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit"))
+
+            self.assertFalse(hasattr(cdf, "__simple_extension_does_not_exsit"))
 
     def test_df_get_item(self):
         # SPARK-41779: test __getitem__


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-55179][PYTHON][CONNECT] Skip eager column name validation in `df.col_name`

Reply via email to