[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

Ken Carlile (JIRA) Thu, 17 Jul 2014 08:23:45 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065025#comment-14065025
 ]


Ken Carlile commented on SPARK-2282:
------------------------------------

A little more info: 
Nodes are running Scientific Linux 6.3 (Linux 2.6.32-279.el6.x86_64 #1 SMP Thu 
Jun 21 07:08:44 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux)
Spark is run against Python 2.7.6, Java 1.7.0.25, and Scala 2.10.3. 

spark-env.sh
{code}
#!/usr/bin/env bash
ulimit -n 65535
export SCALA_HOME=/usr/local/scala-2.10.3
export SPARK_WORKER_DIR=/scratch/spark/work
export JAVA_HOME=/usr/local/jdk1.7.0_25
export SPARK_LOG_DIR=~/.spark/logs/$JOB_ID/
export SPARK_EXECUTOR_MEMORY=100g
export SPARK_DRIVER_MEMORY=100g
export SPARK_WORKER_MEMORY=100g
export SPARK_LOCAL_DIRS=/scratch/spark/tmp
export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python
export SPARK_SLAVES=/scratch/spark/tmp/slaves
{code}

spark-defaults.conf:
{code}
spark.akka.timeout=300 
spark.storage.blockManagerHeartBeatMs=30000 
spark.akka.retry.wait=30 
spark.akka.frameSize=10000
{code}

> PySpark crashes if too many tasks complete quickly
> --------------------------------------------------
>
>                 Key: SPARK-2282
>                 URL: https://issues.apache.org/jira/browse/SPARK-2282
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.1, 1.0.0, 1.0.1
>            Reporter: Aaron Davidson
>            Assignee: Aaron Davidson
>             Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

Reply via email to