[GitHub] spark pull request #19246: [SPARK-22025][PySpark] Speeding up fromInternal f...

HyukjinKwon Tue, 19 Sep 2017 19:09:09 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19246#discussion_r139861001
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -410,6 +410,24 @@ def __init__(self, name, dataType, nullable=True, 
metadata=None):
             self.dataType = dataType
             self.nullable = nullable
             self.metadata = metadata or {}
    +        self.needConversion = dataType.needConversion
    +        self.toInternal = dataType.toInternal
    +        self.fromInternal = dataType.fromInternal
    +
    +    def __getstate__(self):
    +        """Return state values to be pickled."""
    +        return (self.name, self.dataType, self.nullable, self.metadata)
    +
    +    def __setstate__(self, state):
    +        """Restore state from the unpickled state values."""
    +        name, dataType, nullable, metadata = state
    +        self.name = name
    +        self.dataType = dataType
    +        self.nullable = nullable
    +        self.metadata = metadata
    +        self.needConversion = dataType.needConversion
    --- End diff --
    
    My only main concern is, it replaces the reference of the bound method 
from`StructType` to another method bound to another instance. I don't actually 
quite like a monkey patch in Python because, IMHO, it confuses other 
developers, which might slow down the improvement iteration from the community.
    
    I just ran the Python profile on the top of the current master with this 
patch:
    
    **Before**
    
    ```
    ============================================================
    Profile of RDD<id=13>
    ============================================================
             220158736 function calls (210148475 primitive calls) in 373.886 
seconds
    ```
    
    **After**
    
    ```
    ============================================================
    Profile of RDD<id=13>
    ============================================================
             210149857 function calls (200139596 primitive calls) in 377.577 
seconds
    ```
    
    Looks the improvement is not quite significant.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19246: [SPARK-22025][PySpark] Speeding up fromInternal f...

Reply via email to