[
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062916#comment-14062916
]
Ken Carlile edited comment on SPARK-2282 at 7/16/14 12:17 AM:
--------------------------------------------------------------
We may be running into this issue on our cluster. Any input on whether this
property needs to be set on all nodes or on only the master? I ask because we
dynamically spin up spark clusters on a larger general purpose compute cluster,
so I'm hesitant to start changing sysctls willy nilly unless I absolutely have
to.
Alternately, is that SO_REUSEADDR merely a setting one can chnage in one of the
conf files, or is that within the software written for spark? (I'm coming at
this from a sysadmin point of view, so the former would be much easier!)
Odd thing is that we're seeing it on 1.0.1, in which it is supposed to be
fixed...
Thanks,
Ken
was (Author: carlilek):
We may be running into this issue on our cluster. Any input on whether this
property needs to be set on all nodes or on only the master? I ask because we
dynamically spin up spark clusters on a larger general purpose compute cluster,
so I'm hesitant to start changing sysctls willy nilly unless I absolutely have
to.
Alternately, is that SO_REUSEADDR merely a setting one can chnage in one of the
conf files, or is that within the software written for spark? (I'm coming at
this from a sysadmin point of view, so the former would be much easier!)
Thanks,
Ken
> PySpark crashes if too many tasks complete quickly
> --------------------------------------------------
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 0.9.1, 1.0.0, 1.0.1
> Reporter: Aaron Davidson
> Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to
> the Accumulator server running inside the pyspark daemon. This can cause a
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination
> stage, which will cause the SparkContext to crash if too many tasks complete
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.
--
This message was sent by Atlassian JIRA
(v6.2#6252)