Xiao Ming Bao created SPARK-17602:

             Summary: PySpark - Performance Optimization Large Size of 
Broadcast Variable
                 Key: SPARK-17602
                 URL: https://issues.apache.org/jira/browse/SPARK-17602
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.5.1
         Environment: Linux
            Reporter: Xiao Ming Bao
             Fix For: 2.0.0

Problem: currently at executor side, the broadcast variable is written to disk 
as file and each python work process reads the bd from local disk and 
de-serialize to python object before executing a task, when the size of 
broadcast  variables is large, the read/de-serialization takes a lot of time. 
And when the python worker is NOT reused and the number of task is large, this 
performance would be very bad since python worker needs to read/de-serialize 
for each task. 

Brief of the solution:
 transfer the broadcast variable to daemon python process via file (or 
socket/mmap) and deserialize file to object in daemon python process, after 
worker python process forked by daemon python process, worker python process 
would automatically has the deserialzied object and use it directly because of 
the memory Copy-on-write tech of Linux.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to