Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19246#discussion_r139861001
--- Diff: python/pyspark/sql/types.py ---
@@ -410,6 +410,24 @@ def __init__(self, name, dataType, nullable=True,
metadata=None):
self.dataType = dataType
self.nullable = nullable
self.metadata = metadata or {}
+ self.needConversion = dataType.needConversion
+ self.toInternal = dataType.toInternal
+ self.fromInternal = dataType.fromInternal
+
+ def __getstate__(self):
+ """Return state values to be pickled."""
+ return (self.name, self.dataType, self.nullable, self.metadata)
+
+ def __setstate__(self, state):
+ """Restore state from the unpickled state values."""
+ name, dataType, nullable, metadata = state
+ self.name = name
+ self.dataType = dataType
+ self.nullable = nullable
+ self.metadata = metadata
+ self.needConversion = dataType.needConversion
--- End diff --
My only main concern is, it replaces the reference of the bound method
from`StructType` to another method bound to another instance. I don't actually
quite like a monkey patch in Python because, IMHO, it confuses other
developers, which might slow down the improvement iteration from the community.
I just ran the Python profile on the top of the current master with this
patch:
**Before**
```
============================================================
Profile of RDD<id=13>
============================================================
220158736 function calls (210148475 primitive calls) in 373.886
seconds
```
**After**
```
============================================================
Profile of RDD<id=13>
============================================================
210149857 function calls (200139596 primitive calls) in 377.577
seconds
```
Looks the improvement is not quite significant.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]