[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

Ken Carlile (JIRA) Thu, 17 Jul 2014 11:15:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065272#comment-14065272
 ]


Ken Carlile commented on SPARK-2282:
------------------------------------

So I've tried a few different things at this point, and I see the behavior 
regardless of how I have the sysctls set. Using a {code}watch -n1 "netstat -anp 
| grep TIME_WAIT | wc -l"{code} command, I can see the number of ephemeral 
ports used climb up and up and up (to somewhere north of 29000), and then it 
crashes out, and the number of TIME_WAITs gradually decreases. If I use a 5 
node cluster, I can see it climbing, but it decreases while it is running as 
well, and the 20 iterations manage to complete. The maximum I have seen with a 
5 node is ~19000 TIME_WAITs. 

I see the exact same behavior with the sysctls turned off, so I have to assume 
that the code change has worked around the issue. However, our cluster has 
very, very fast communication between nodes, so we still run up against the 
core issue (that of Spark using A TON of ephemeral ports for pyspark) with 
larger worker counts (ie, greater than, say, 10). 

> PySpark crashes if too many tasks complete quickly
> --------------------------------------------------
>
>                 Key: SPARK-2282
>                 URL: https://issues.apache.org/jira/browse/SPARK-2282
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.1, 1.0.0, 1.0.1
>            Reporter: Aaron Davidson
>            Assignee: Aaron Davidson
>             Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

Reply via email to