This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 43529889f240 [SPARK-55179][PYTHON][CONNECT] Skip eager column name
validation in `df.col_name`
43529889f240 is described below
commit 43529889f24011c3df1d308e8b673967818b7c33
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Wed Jan 28 11:28:41 2026 +0800
[SPARK-55179][PYTHON][CONNECT] Skip eager column name validation in
`df.col_name`
### What changes were proposed in this pull request?
Follow up of https://github.com/apache/spark/pull/51400, also skip eager
column name validation in `df.col_name`
### Why are the changes needed?
to be consistent with `df["col_name"]`
### Does this PR introduce _any_ user-facing change?
yes, `df.bad_column` failure will be delayed to analysis or execution
before:
```
In [1]: df = spark.range(10)
In [2]: df.abc
---------------------------------------------------------------------------
...
PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `abc` is not
supported.
In [3]: df._abc
---------------------------------------------------------------------------
...
PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `_abc` is not
supported.
In [4]: df.__abc
---------------------------------------------------------------------------
...
PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `__abc` is not
supported.
```
after:
```
In [1]: df = spark.range(10)
In [2]: df.abc
Out[2]: Column<'abc'>
In [3]: df._abc
Out[3]: Column<'_abc'>
In [4]: df.__abc
---------------------------------------------------------------------------
...
PySparkAttributeError: [ATTRIBUTE_NOT_SUPPORTED] Attribute `__abc` is not
supported.
```
### How was this patch tested?
ci
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #53059 from zhengruifeng/lazy_df_dot_col2.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
.../source/migration_guide/pyspark_upgrade.rst | 3 ++-
python/pyspark/sql/connect/dataframe.py | 18 ++++++++++++----
.../sql/tests/connect/test_connect_basic.py | 24 +++++++++++++++++++++-
3 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index 2a6d9c55d2ff..dba2a6266939 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -22,6 +22,7 @@ Upgrading PySpark
Upgrading from PySpark 4.1 to 4.2
---------------------------------
* In Spark 4.2, the minimum supported version for PyArrow has been raised from
15.0.0 to 18.0.0 in PySpark.
+* In Spark 4.2, ``DataFrame.__getattr__`` on Spark Connect Python Client no
longer eagerly validate the column name. To restore the legacy behavior, set
``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
* In Spark 4.2, columnar data exchange between PySpark and the JVM uses Apache
Arrow by default. The configuration
``spark.sql.execution.arrow.pyspark.enabled`` now defaults to true. To restore
the legacy (non-Arrow) row-based data exchange, set
``spark.sql.execution.arrow.pyspark.enabled`` to ``false``.
* In Spark 4.2, regular Python UDFs are Arrow-optimized by default. The
configuration ``spark.sql.execution.pythonUDF.arrow.enabled`` now defaults to
true. To restore the legacy behavior for Python UDF execution, set
``spark.sql.execution.pythonUDF.arrow.enabled`` to ``false``.
* In Spark 4.2, regular Python UDTFs are Arrow-optimized by default. The
configuration ``spark.sql.execution.pythonUDTF.arrow.enabled`` now defaults to
true. To restore the legacy behavior for Python UDTF execution, set
``spark.sql.execution.pythonUDTF.arrow.enabled`` to ``false``.
@@ -32,7 +33,7 @@ Upgrading from PySpark 4.0 to 4.1
* In Spark 4.1, Python 3.9 support was dropped in PySpark.
* In Spark 4.1, the minimum supported version for PyArrow has been raised from
11.0.0 to 15.0.0 in PySpark.
* In Spark 4.1, the minimum supported version for Pandas has been raised from
2.0.0 to 2.2.0 in PySpark.
-* In Spark 4.1, ``DataFrame['name']`` on Spark Connect Python Client no longer
eagerly validate the column name. To restore the legacy behavior, set
``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
+* In Spark 4.1, ``DataFrame.__getitem__`` on Spark Connect Python Client no
longer eagerly validate the column name. To restore the legacy behavior, set
``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
* In Spark 4.1, Arrow-optimized Python UDF supports UDT input / output instead
of falling back to the regular UDF. To restore the legacy behavior, set
``spark.sql.execution.pythonUDF.arrow.legacy.fallbackOnUDT`` to ``true``.
* In Spark 4.1, unnecessary conversion to pandas instances is removed when
``spark.sql.execution.pythonUDF.arrow.enabled`` is enabled. As a result, the
type coercion changes when the produced output has a schema different from the
specified schema. To restore the previous behavior, enable
``spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled``.
* In Spark 4.1, unnecessary conversion to pandas instances is removed when
``spark.sql.execution.pythonUDTF.arrow.enabled`` is enabled. As a result, the
type coercion changes when the produced output has a schema different from the
specified schema. To restore the previous behavior, enable
``spark.sql.legacy.execution.pythonUDTF.pandas.conversion.enabled``.
diff --git a/python/pyspark/sql/connect/dataframe.py
b/python/pyspark/sql/connect/dataframe.py
index fa3b1a842090..0df13c1020d7 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1736,10 +1736,20 @@ class DataFrame(ParentDataFrame):
errorClass="JVM_ATTRIBUTE_NOT_SUPPORTED",
messageParameters={"attr_name": name}
)
- if name not in self.columns:
- raise PySparkAttributeError(
- errorClass="ATTRIBUTE_NOT_SUPPORTED",
messageParameters={"attr_name": name}
- )
+ # Only eagerly validate the column name when:
+ # 1, PYSPARK_VALIDATE_COLUMN_NAME_LEGACY is set 1; or
+ # 2, the name starts with '__', because it is likely a python internal
method and
+ # an AttributeError might be expected to check whether the attribute
exists.
+ # For example:
+ # pickle/cloudpickle need to check whether method '__setstate__' is
defined or not,
+ # and it internally invokes __getattr__("__setstate__").
+ # Returning a dataframe column self._col("__setstate__") in this case
will break
+ # the serialization of connect dataframe and features built atop it
(e.g. FEB).
+ if os.environ.get("PYSPARK_VALIDATE_COLUMN_NAME_LEGACY") == "1" or
name.startswith("__"):
+ if name not in self.columns:
+ raise PySparkAttributeError(
+ errorClass="ATTRIBUTE_NOT_SUPPORTED",
messageParameters={"attr_name": name}
+ )
return self._col(name)
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index de470b8ef26f..5e2a5bf97796 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -143,6 +143,16 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
cdf2 = loads(data)
self.assertEqual(cdf.collect(), cdf2.collect())
+ def test_serialization_II(self):
+ from pyspark.serializers import CPickleSerializer
+
+ pickle_ser = CPickleSerializer()
+
+ cdf = self.connect.range(10)
+ data = pickle_ser.dumps(cdf)
+ cdf2 = pickle_ser.loads(data)
+ self.assertEqual(cdf.collect(), cdf2.collect())
+
def test_window_spec_serialization(self):
from pyspark.sql.connect.window import Window
from pyspark.serializers import CPickleSerializer
@@ -164,7 +174,19 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
self.assertEqual(type(sdf._simple_extension),
type(cdf._simple_extension))
self.assertTrue(hasattr(cdf, "_simple_extension"))
- self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit"))
+
+ with self.temp_env({"PYSPARK_VALIDATE_COLUMN_NAME_LEGACY": None}):
+ self.assertTrue(hasattr(cdf, "_simple_extension_does_not_exsit"))
+ self.assertIsInstance(getattr(cdf,
"_simple_extension_does_not_exsit"), Column)
+ self.assertIsInstance(cdf._simple_extension_does_not_exsit, Column)
+
+ # For name starting with '__', still validate it eagerly
+ self.assertFalse(hasattr(cdf, "__simple_extension_does_not_exsit"))
+
+ with self.temp_env({"PYSPARK_VALIDATE_COLUMN_NAME_LEGACY": "1"}):
+ self.assertFalse(hasattr(cdf, "_simple_extension_does_not_exsit"))
+
+ self.assertFalse(hasattr(cdf, "__simple_extension_does_not_exsit"))
def test_df_get_item(self):
# SPARK-41779: test __getitem__
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]