[ 
https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220408#comment-14220408
 ] 

Apache Spark commented on SPARK-4517:
-------------------------------------

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3394

> Improve memory efficiency for python broadcast
> ----------------------------------------------
>
>                 Key: SPARK-4517
>                 URL: https://issues.apache.org/jira/browse/SPARK-4517
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Davies Liu
>
> Currently, the Python broadcast (TorrentBroadcast) will have multiple copies 
> in :
> 1) 1 copy in python driver
> 2) 1 copy in disks of driver (serialized and compressed)
> 3) 2 copies in JVM driver (one is unserialized, one is serialized and 
> compressed)
> 4) 2 copies in executor (one is unserialized, one is serialized and 
> compressed)
> 5) one copy in each python worker.
> Some of them are different in HTTPBroadcast:
> 3)  one copy in memory of driver, one copy in disk (serialized and compressed)
> 4) one copy in memory of executor
> If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in 
> executor (x is the number of python worker, it's the number of CPUs usually).
> The Python broadcast is already serialized and compressed in Python, it 
> should not be serialized and compressed again in JVM. Also, JVM does not need 
> to know the content of it, so it could be out of JVM.
> So, we should have specified broadcast implementation for Python, it stores 
> the serialized and compressed data in disks, transferred to executors in p2p 
> way (similar to TorrentBroadcast), sent to python workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to