Sean Owen updated SPARK-17602:
                Flags:   (was: Patch)
    Affects Version/s:     (was: 1.5.2)
                           (was: 1.5.1)
     Target Version/s:   (was: 2.0.1)
               Labels:   (was: performance)
        Fix Version/s:     (was: 2.0.0)

> PySpark - Performance Optimization Large Size of Broadcast Variable
> -------------------------------------------------------------------
>                 Key: SPARK-17602
>                 URL: https://issues.apache.org/jira/browse/SPARK-17602
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Linux
>            Reporter: Xiao Ming Bao
>         Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>   Original Estimate: 120h
>  Remaining Estimate: 120h
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to