[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

holdenk (JIRA) Wed, 18 Jan 2017 12:17:48 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828709#comment-15828709
 ]


holdenk commented on SPARK-17602:
---------------------------------

Ah yes, sorry I've been pretty busy. I just had an interesting chat over the 
weekend at a conference with  someone who was running into some challanges that 
could be improved by this so lets take a look. If [~davies] has some bandwith 
to look at the design doc that would be a good starting point otherwise making 
a PR would be maybe be a good next step and then [~davies] or I could take a 
look after Spark Summit (I've got some stuff I need to get in order before 
then).

> PySpark - Performance Optimization Large Size of Broadcast Variable
> -------------------------------------------------------------------
>
>                 Key: SPARK-17602
>                 URL: https://issues.apache.org/jira/browse/SPARK-17602
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Linux
>            Reporter: Xiao Ming Bao
>         Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

Reply via email to