GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/1503

    SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark

    Prior to this change, every PySpark task completion opened a new socket to 
the accumulator server, passed its updates through, and then quit. I'm not 
entirely sure why PySpark always sends accumulator updates, but regardless this 
causes a very rapid buildup of ephemeral TCP connections that remain in the 
TCP_WAIT state for around a minute before being cleaned up.
    
    Rather than trying to allow these sockets to be cleaned up faster, this 
patch simply reuses the connection between tasks completions (since they're fed 
updates in a single-threaded manner by the DAGScheduler anyway).
    
    The only tricky part here was making sure that the AccumulatorServer was 
able to shutdown in a timely manner (i.e., stop polling for new data), and this 
was accomplished via minor feats of magic.
    
    I have confirmed that this patch eliminates the buildup of ephemeral 
sockets due to the accumulator updates. However, I did note that there were 
still significant sockets being created against the PySpark daemon port, but my 
machine was not able to create enough sockets fast enough to fail. This may not 
be the last time we've seen this issue, though.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark accum

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1503
    
----
commit b3e12f7504150b8f355e263d740d8f4c2f788820
Author: Aaron Davidson <[email protected]>
Date:   2014-07-20T23:31:18Z

    SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark
    
    Prior to this change, every PySpark task completion opened a new
    socket to the accumulator server, passed its updates through, and
    then quit. I'm not entirely sure why PySpark always sends accumulator
    updates, but regardless this causes a very rapid buildup of ephemeral
    TCP connections that remain in the TCP_WAIT state for around a minute
    before being cleaned up.
    
    Rather than trying to allow these sockets to be cleaned up faster, this
    patch simply reuses the connection between tasks completions (since they're
    fed updates in a single-threaded manner by the DAGScheduler anyway).
    
    The only tricky part here was making sure that the AccumulatorServer was
    able to shutdown in a timely manner (i.e., stop polling for new data), and
    this was accomplished via minor feats of magic.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to