shane knapp created SPARK-8571:
----------------------------------
Summary: spark streaming hanging processes upon build exit
Key: SPARK-8571
URL: https://issues.apache.org/jira/browse/SPARK-8571
Project: Spark
Issue Type: Bug
Components: Build, Streaming
Environment: centos 6.6 amplab build system
Reporter: shane knapp
Priority: Minor
over the past 3 months i've been noticing that there are occasionally hanging
processes on our build system workers after various spark builds have finished.
these are all spark streaming processes.
today i noticed a 3+ hour spark build that was timed out after 200 minutes
(https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on
amp-jenkins-worker-02. after the timeout, it left the following process (and
all of it's children) hanging.
the process' CLI command was:
{{monospaced}}
[root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
jenkins 1714 733 2.7 21342148 3642740 ? Sl 07:52 1713:41 java
-Dderby.system.durability=test -Djava.awt.headless=true
-Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
-Dspark.driver.allowMultipleContexts=true
-Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
-Dspark.testing=1 -Dspark.ui.enabled=false
-Dspark.ui.showConsoleProgress=false
-Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
-ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m
org.scalatest.tools.Runner -R
/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
-o -f
/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
-u
/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
{{monospaced}}
stracing that process doesn't give us much:
{{monospaced}}
[root@amp-jenkins-worker-02 ~]# strace -p 1714
Process 1714 attached - interrupt to quit
futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
{{monospaced}}
stracing it's children gives is a *little* bit more... some loop like this:
{{monospaced}}
<snip>
futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
{{monospaced}}
and others loop on prtrace_attach (no such process) or restart_syscall
(resuming interrupted call)
even though this behavior has been solidly pinned to jobs timing out (which
ends w/an aborted, not failed, build), i've seen it happen for failed builds as
well. if i see any hanging processes from failed (not aborted) builds, i will
investigate them and update this bug as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]