[GitHub] spark pull request: Adding zipPartitions to PySpark

jegonzal Fri, 01 Jan 2016 20:06:44 -0800

GitHub user jegonzal opened a pull request:

    https://github.com/apache/spark/pull/10550


    Adding zipPartitions to PySpark

    The following working WIP adds support for `zipPartitions` to PySpark.  
This is accomplished by modifying the PySpark `worker` (in both daemon and 
non-deamon mode) to open a second socket back to the Spark process.  The second 
socket is used to send tuple from the second iterator in `zipPartitions` 
enabling the user defined function to pull tuples from both iterators at 
different rates without requiring a back-and-forth protocol over the primary 
socket.  The single socket protocol design was considered but creates issues 
with the built-in serializers and would require much larger changes.  The 
second socket is always created at the launch of the worker process and is 
simply ignored if it is not needed.  
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jegonzal/spark multi_iterator_pyspark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10550.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10550
    
----
commit 70650ab94ae5dceca2dd6a970035d45dffdce2b1
Author: Joseph Gonzalez <[email protected]>
Date:   2016-01-02T01:40:10Z

    compiling prototype

commit 61512acb2dba276b2bbd1bca5d22ff2474f6def5
Author: Joseph Gonzalez <[email protected]>
Date:   2016-01-02T03:51:40Z

    addressing a bug where sockets could get created multiple times

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Adding zipPartitions to PySpark

Reply via email to