[GitHub] spark pull request: [SPARK-4548] improve performance of python bro...

davies Sat, 22 Nov 2014 12:04:47 -0800

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/3417


    [SPARK-4548] improve performance of python broadcast

    Re-implement the Python broadcast using file:
    
    1) serialize the python object using cPickle, write into disks.
    2) Create a wrapper in JVM (for the dumped file), it read data from during 
serialization
    3) Using TorrentBroadcast or HttpBroadcast to transfer the data 
(compressed) into executors
    4) During deserialization, writing the data into disk.
    5) Passing the path into Python worker, read data from disk and unpickle it 
into python object, until the first access.
    
    It fixes the performance regression introduced in #2659, has similar 
performance as 1.1, but support object larger than 2G, also improve the memory 
efficiency (only one compressed copy in driver and executor).
                            
    Testing with a 500M broadcast and 4 tasks (excluding the benefit from 
reused worker in 1.2):
             name |   1.1   | 1.2 with this patch |  
    ==============|========|========|======               
          python-broadcast-w-bytes  |   25.20  |        9.33   |        170.13% 
|
            python-broadcast-w-set        |     4.13       |    4.50  | -8.35%  
|
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark pybroadcast

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3417.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3417
    
----
commit dde02dd57499ff3e8dd263d84961ff7005c385f0
Author: Davies Liu <[email protected]>
Date:   2014-11-22T08:17:30Z

    improve performance of python broadcast

commit 09303b8efeebc2375e246a86beafe79dc36dde6c
Author: Davies Liu <[email protected]>
Date:   2014-11-22T19:13:56Z

    read all data into memory

commit e5ee6b98f625a10c075322c9d2c1fbabe80510d6
Author: Davies Liu <[email protected]>
Date:   2014-11-22T19:31:50Z

    support large string

commit b98de1d9ea133e32bc577342078fa942799963a0
Author: Davies Liu <[email protected]>
Date:   2014-11-22T19:42:46Z

    disable gc while unpickle

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4548] improve performance of python bro...

Reply via email to