[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065025#comment-14065025 ]
Ken Carlile commented on SPARK-2282: ------------------------------------ A little more info: Nodes are running Scientific Linux 6.3 (Linux 2.6.32-279.el6.x86_64 #1 SMP Thu Jun 21 07:08:44 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux) Spark is run against Python 2.7.6, Java 1.7.0.25, and Scala 2.10.3. spark-env.sh {code} #!/usr/bin/env bash ulimit -n 65535 export SCALA_HOME=/usr/local/scala-2.10.3 export SPARK_WORKER_DIR=/scratch/spark/work export JAVA_HOME=/usr/local/jdk1.7.0_25 export SPARK_LOG_DIR=~/.spark/logs/$JOB_ID/ export SPARK_EXECUTOR_MEMORY=100g export SPARK_DRIVER_MEMORY=100g export SPARK_WORKER_MEMORY=100g export SPARK_LOCAL_DIRS=/scratch/spark/tmp export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python export SPARK_SLAVES=/scratch/spark/tmp/slaves {code} spark-defaults.conf: {code} spark.akka.timeout=300 spark.storage.blockManagerHeartBeatMs=30000 spark.akka.retry.wait=30 spark.akka.frameSize=10000 {code} > PySpark crashes if too many tasks complete quickly > -------------------------------------------------- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 0.9.1, 1.0.0, 1.0.1 > Reporter: Aaron Davidson > Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)