[
https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222275#comment-14222275
]
Apache Spark commented on SPARK-4517:
-------------------------------------
User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3417
> Improve memory efficiency for python broadcast
> ----------------------------------------------
>
> Key: SPARK-4517
> URL: https://issues.apache.org/jira/browse/SPARK-4517
> Project: Spark
> Issue Type: Improvement
> Reporter: Davies Liu
>
> Currently, the Python broadcast (TorrentBroadcast) will have multiple copies
> in :
> 1) 1 copy in python driver
> 2) 1 copy in disks of driver (serialized and compressed)
> 3) 2 copies in JVM driver (one is unserialized, one is serialized and
> compressed)
> 4) 2 copies in executor (one is unserialized, one is serialized and
> compressed)
> 5) one copy in each python worker.
> Some of them are different in HTTPBroadcast:
> 3) one copy in memory of driver, one copy in disk (serialized and compressed)
> 4) one copy in memory of executor
> If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in
> executor (x is the number of python worker, it's the number of CPUs usually).
> The Python broadcast is already serialized and compressed in Python, it
> should not be serialized and compressed again in JVM. Also, JVM does not need
> to know the content of it, so it could be out of JVM.
> So, we should have specified broadcast implementation for Python, it stores
> the serialized and compressed data in disks, transferred to executors in p2p
> way (similar to TorrentBroadcast), sent to python workers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]