Lei Xu created MESOS-4297:
-----------------------------
Summary: Executor does not shutdown when framework teardown.
Key: MESOS-4297
URL: https://issues.apache.org/jira/browse/MESOS-4297
Project: Mesos
Issue Type: Bug
Components: framework
Affects Versions: 0.25.0
Environment: Marathon 0.11.0
Mesos 0.25.0
Spark 1.5.2
Reporter: Lei Xu
Priority: Critical
We found a problem when teardown a Spark framework on Mesos, the executor could
not exit and still running.
{code}
root 48548 48539 2 2015 ? 04:28:11 /home/q/java/default/bin/java
-cp
/home/q/mesos/data/slaves/4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/frameworks/20151228-163100-504125962-5050-31081-0016/executors/3/runs/ca324f08-5be9-4457-a2a7-56f2605d6027/spark-1.5.2-bin-2.2.0/conf/:/home/q/mesos/data/slaves/4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/frameworks/20151228-163100-504125962-5050-31081-0016/executors/3/runs/ca324f08-5be9-4457-a2a7-56f2605d6027/spark-1.5.2-bin-2.2.0/lib/spark-assembly-1.5.2-hadoop2.2.0.jar
-Xms8192m -Xmx8192m org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url
akka.tcp://[email protected]:47938/user/CoarseGrainedScheduler
--executor-id 4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/3 --hostname
l-qosslave26.ops.cn2.qunar.com --cores 2 --app-id
20151228-163100-504125962-5050-31081-0016
root 48644 48348 0 2015 ? 00:00:00 sh -c cd spark-1*;
./bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url
akka.tcp://[email protected]:47938/user/CoarseGrainedScheduler
--executor-id 4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/5 --hostname
l-qosslave26.ops.cn2.qunar.com --cores 2 --app-id
20151228-163100-504125962-5050-31081-0016
root 48645 48644 2 2015 ? 04:28:45 /home/q/java/default/bin/java
-cp
/home/q/mesos/data/slaves/4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/frameworks/20151228-163100-504125962-5050-31081-0016/executors/5/runs/851073c4-d225-426b-b1b5-3d294eb76f8e/spark-1.5.2-bin-2.2.0/conf/:/home/q/mesos/data/slaves/4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/frameworks/20151228-163100-504125962-5050-31081-0016/executors/5/runs/851073c4-d225-426b-b1b5-3d294eb76f8e/spark-1.5.2-bin-2.2.0/lib/spark-assembly-1.5.2-hadoop2.2.0.jar
-Xms8192m -Xmx8192m org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url
akka.tcp://[email protected]:47938/user/CoarseGrainedScheduler
--executor-id 4d0f0fc7-99f4-4a9a-b5d5-6c25affcb4f1-S127/5 --hostname
l-qosslave26.ops.cn2.qunar.com --cores 2 --app-id
20151228-163100-504125962-5050-31081-0016
{code}
This framework {{20151228-163100-504125962-5050-31081-0016}} has already
teardown a few days ago, And could not find in "Frameworks" page via webui. But
in the slave page, I found it still registered with slave node and run some
executors.
And I try to use REST API to kill the framework again, it returns {{No
framework found with specified ID}}.
At last I killed the Spark task and mesos executor, there is no new task
started by framework, but it still on this slave and does not exit.
{code}
Frameworks
ID User Name Active Tasks CPUs (Used / Allocated) Mem
(Used / Allocated)
…5050-31081-0016
root wireless-m_invocation_kylin 0 / 0.6 / 192 MB
Executors
ID Name Source Active Tasks Queued Tasks CPUs (Used / Allocated)
Mem (Used / Allocated)
5 Command Executor (Task: 5) (Command: sh -c 'cd spark-1*;...') 5
0 0 / 0.1 / 32 MB Sandbox
4 Command Executor (Task: 4) (Command: sh -c 'cd spark-1*;...') 4
0 0 / 0.1 / 32 MB Sandbox
3 Command Executor (Task: 3) (Command: sh -c 'cd spark-1*;...') 3
0 0 / 0.1 / 32 MB Sandbox
2 Command Executor (Task: 2) (Command: sh -c 'cd spark-1*;...') 2
0 0 / 0.1 / 32 MB Sandbox
1 Command Executor (Task: 1) (Command: sh -c 'cd spark-1*;...') 1
0 0 / 0.1 / 32 MB Sandbox
0 Command Executor (Task: 0) (Command: sh -c 'cd spark-1*;...') 0
0 0 / 0.1 / 32 MB Sandbox
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)