zhengruifeng commented on code in PR #53059:
URL: https://github.com/apache/spark/pull/53059#discussion_r2730798609


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1736,10 +1736,18 @@ def __getattr__(self, name: str) -> "Column":
                 errorClass="JVM_ATTRIBUTE_NOT_SUPPORTED", 
messageParameters={"attr_name": name}
             )
 
-        if name not in self.columns:
-            raise PySparkAttributeError(
-                errorClass="ATTRIBUTE_NOT_SUPPORTED", 
messageParameters={"attr_name": name}
-            )
+        # Only eagerly validate the column name when:
+        # 1, PYSPARK_VALIDATE_COLUMN_NAME_LEGACY is set 1; or
+        # 2, name starting with '__', because this is likely a python internal 
method and
+        # an AttributeError is expected, for example,
+        # pickle will internally invoke __getattr__("__setstate__"), returning 
a column

Review Comment:
   to resolve pickle issue, we can add a dedicated `__setstate__` method (see 
https://github.com/apache/spark/pull/53059#discussion_r2726808232)
   
   but after offline discussion with @gaogaotiantian, we will filter out the 
names staring with `__`, because it likely should be treated as python internal 
functions and returning a column might cause unexpected behavior, e.g. 
   the `__setstate__` used in pickle 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to