A. Dukhovniy created MESOS-7752:
-----------------------------------
Summary: Command executor still active after terminal task state
update.
Key: MESOS-7752
URL: https://issues.apache.org/jira/browse/MESOS-7752
Project: Mesos
Issue Type: Bug
Affects Versions: 1.3.0
Reporter: A. Dukhovniy
Here is a rather simple scenario to reproduce this error:
* Frameworks starts a task with taskId = _task1_
* Framework kills _task1_ *successfully* and *acknowledges* TASK_KILLED
* Framework starts another task with the same _task1_ but receives
"_TASK_FAILED (Attempted to run multiple tasks using a "command" executor)_"
*Note*: this test is racy so this scenario fails occasionally.
*Here is a full log* from that show a life-cycle of a task id
_app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c_:
{code:java}
# Starting...
WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:14.476085 14666 master.cpp:3352] Authorizing framework principal
'principal' to launch task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:14.510136 14666 master.cpp:4426] Launching task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 (marathon) at
[email protected]:61567 with resources...
WARN [10:51:14 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:14.513908 14697 slave.cpp:2118] Queued task
'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
for executor
'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000
WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:15.011696 14671 master.cpp:6222] Forwarding status update TASK_RUNNING
(UUID: ed2d0475-9d83-4e09-9f54-5b4d323e4558) for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000
WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:15.036391 14671 master.cpp:5092] Processing ACKNOWLEDGE call
ed2d0475-9d83-4e09-9f54-5b4d323e4558 for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 (marathon) at
[email protected]:61567 on agent
76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0
{code}
{code:java}
# Killing...
DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] WARN [10:51:15
KillAction$] Killing known task
[app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c]
of instance instance
[app-restart-resident-app-with-five-instances.marathon-8882bd16-5fdd-11e7-a00e-0242aceef95c]
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.196702 14697 slave.cpp:3816] Handling status update TASK_KILLED (UUID:
f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 from
executor(1)@172.16.10.121:35184
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.197676 14697 slave.cpp:4166] Sending acknowledgement for status update
TASK_KILLED (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 to
executor(1)@172.16.10.121:35184
WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:15.198299 14671 master.cpp:6154] Status update TASK_KILLED (UUID:
f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 from agent
76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 at slave(1)@172.16.10.121:32788
(172.16.10.121)
DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] INFO [10:51:15
MarathonScheduler] Received status update for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c:
TASK_KILLED (Command terminated with signal Terminated)
WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:15.216081 14671 master.cpp:5092] Processing ACKNOWLEDGE call
f7e9d0bc-726c-43aa-9ddc-3b082a68642e for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 (marathon) at
[email protected]:61567 on agent
76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0
WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:15.216107 14671 master.cpp:8396] Removing task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
with resources...
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.216667 14697 status_update_manager.cpp:395] Received status update
acknowledgement (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.216722 14697 status_update_manager.cpp:832] Checkpointing ACK for
status update TASK_KILLED (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000
{code}
{code:java}
# and starting again:
WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703
10:51:15.247561 14671 master.cpp:3352] Authorizing framework principal
'principal' to launch task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.252348 14697 slave.cpp:1625] Got assigned task
'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
for framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.252707 14697 slave.cpp:1785] Launching task
'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
for framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000
WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703
10:51:15.253159 14697 slave.cpp:2140] Queued task
'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
for executor
'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 at
executor(1)@172.16.10.121:35184
DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] INFO [10:51:15
MarathonScheduler] Received status update for task
app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c:
TASK_FAILED (Attempted to run multiple tasks using a "command" executor)
{code}
*TL;DR*: framework receives and acknowledges _TASK_KILLED_ status but fails to
re-start the task because _"Attempted to run multiple tasks using a "command"
executor"_
Though reusing task Ids is discouraged
{code:java}
/**
* A framework-generated ID to distinguish a task. The ID must remain
* unique while the task is active. A framework can reuse an ID _only_
* if the previous task with the same ID has reached a terminal state
* (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). However,
* reusing task IDs is strongly discouraged (MESOS-2198).
*/{code}
it is acceptable after receiving a terminal tasks status which happened above.
*Possible cause*:
I assume that occasionally the executor is not yet cleaned and is reused during
task restart. This however fails here:
https://github.com/apache/mesos/blob/35dd2b600b8af0204d03c4ee5348a1a6672b136c/src/launcher/executor.cpp#L512
/cc [~tillt]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)