GitHub user jegonzal opened a pull request:
https://github.com/apache/spark/pull/10550
Adding zipPartitions to PySpark
The following working WIP adds support for `zipPartitions` to PySpark.
This is accomplished by modifying the PySpark `worker` (in both daemon and
non-deamon mode) to open a second socket back to the Spark process. The second
socket is used to send tuple from the second iterator in `zipPartitions`
enabling the user defined function to pull tuples from both iterators at
different rates without requiring a back-and-forth protocol over the primary
socket. The single socket protocol design was considered but creates issues
with the built-in serializers and would require much larger changes. The
second socket is always created at the launch of the worker process and is
simply ignored if it is not needed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jegonzal/spark multi_iterator_pyspark
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10550.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10550
----
commit 70650ab94ae5dceca2dd6a970035d45dffdce2b1
Author: Joseph Gonzalez <[email protected]>
Date: 2016-01-02T01:40:10Z
compiling prototype
commit 61512acb2dba276b2bbd1bca5d22ff2474f6def5
Author: Joseph Gonzalez <[email protected]>
Date: 2016-01-02T03:51:40Z
addressing a bug where sockets could get created multiple times
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]