GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/16263
[SPARK-18281][SQL][PySpark] Consumes the returned local iterator
immediately to prevent timeout on the socket serving the data
## What changes were proposed in this pull request?
There is a timeout failure when using `rdd.toLocalIterator()` or
`df.toLocalIterator()` for a PySpark RDD and DataFrame:
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()
row = next(it)
The cause of this issue is, we open a socket to serve the data from JVM
side. We set a timeout for the socket to accept connection. If we don't consume
the returned local iterator from `toLocalIterator` in Python immediately, the
socket will be timeout and failed.
## How was this patch tested?
Added tests into PySpark.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 fix-pyspark-localiterator
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16263.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16263
----
commit 6905a700376b2deff77ff539400951cf5e12885d
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-12-13T03:47:10Z
Consumes the returned local iterator immediately to prevent timeout on the
socket serving the data.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]