[GitHub] spark pull request: [SPARK-3886] [PySpark] use AutoBatchedSerializ...

JoshRosen Fri, 10 Oct 2014 14:14:36 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2740#issuecomment-58717116
  
    I tried a small experiment to test this out:
    
    ```python
    import os
    from pyspark import SparkContext, SparkConf
    
    conf = SparkConf().set("spark.executor.memory", "2g")
    sc = SparkContext(conf=conf)
    
    mb = 1000000
    def inflateDataSize(x):
        return bytearray(os.urandom(1 * mb))
    
    sc.parallelize(range(1000), 10).map(inflateDataSize).cache().count()
    ```
    
    Prior to this patch, the Python worker's memory consumption would steadily 
grow while it attempted to batch together 100 MB of data per task, whereas now 
the memory usage remains constant because we emit smaller batches more often 
(since the objects are big).
    
    Thanks for updating the docs.  This looks good to me, so I'm going to merge 
it into master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3886] [PySpark] use AutoBatchedSerializ...

Reply via email to