[
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069758#comment-14069758
]
Jeremy Freeman edited comment on SPARK-2282 at 7/22/14 3:18 AM:
----------------------------------------------------------------
Hi all, I'm "the scientist", a couple updates from more real world testing,
looking very promising!
- Set-up: 60 node cluster, an analysis with iterative updates (essentially a
sequence of two map-reduce steps on each iteration), data cached and counted
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete.
Before the patch I reliably hit the error after about 5 iterations, with the
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete.
Before the patch more than one iteration always failed, with the patch 20+
complete.
So it's looking really good. I can also try the other extreme (very small
cluster) to see if that issue manifests. Aaron, big thanks for helping with
this, it's a big deal for our workflows, so really terrific to get to the
bottom of it!
-- Jeremy
was (Author: freeman-lab):
Hi all, I'm "the scientist", a couple updates from more real world testing,
looking very promising!
- Set-up: 60 node cluster, an analysis with iterative updates (essentially a
sequence of two map-reduce steps on each iteration), data cached and counted
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete.
Before the patch I reliably hit the error after about 5 iterations, with the
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete.
Before the patch more than one iteration always failed (!), with the patch 20+
complete.
So it's looking really good. I can also try the other extreme (very small
cluster) to see if that issue manifests. Aaron, big thanks for helping with
this, it's a big deal for our workflows, so really terrific to get to the
bottom of it!
-- Jeremy
> PySpark crashes if too many tasks complete quickly
> --------------------------------------------------
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 0.9.1, 1.0.0, 1.0.1
> Reporter: Aaron Davidson
> Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to
> the Accumulator server running inside the pyspark daemon. This can cause a
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination
> stage, which will cause the SparkContext to crash if too many tasks complete
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.
--
This message was sent by Atlassian JIRA
(v6.2#6252)