[GitHub] spark pull request #16263: [SPARK-18281][SQL][PySpark] Consumes the returned...

viirya Mon, 12 Dec 2016 21:57:07 -0800

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16263


    [SPARK-18281][SQL][PySpark] Consumes the returned local iterator 
immediately to prevent timeout on the socket serving the data

    ## What changes were proposed in this pull request?
    
    There is a timeout failure when using `rdd.toLocalIterator()` or 
`df.toLocalIterator()` for a PySpark RDD and DataFrame:
    
        df = spark.createDataFrame([[1],[2],[3]])
        it = df.toLocalIterator()
        row = next(it)
    
    The cause of this issue is, we open a socket to serve the data from JVM 
side. We set a timeout for the socket to accept connection. If we don't consume 
the returned local iterator from `toLocalIterator` in Python immediately, the 
socket will be timeout and failed.
    
    ## How was this patch tested?
    
    Added tests into PySpark.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 fix-pyspark-localiterator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16263.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16263
    
----
commit 6905a700376b2deff77ff539400951cf5e12885d
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-12-13T03:47:10Z

    Consumes the returned local iterator immediately to prevent timeout on the 
socket serving the data.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16263: [SPARK-18281][SQL][PySpark] Consumes the returned...

Reply via email to