[
https://issues.apache.org/jira/browse/MESOS-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231541#comment-17231541
]
Andrei Sekretenko commented on MESOS-10194:
-------------------------------------------
[~Jerome Soussens] Two terminal status updates should not be a problem.
Calling `recoverResources()` for a terminal status update is guarded by a check
that the task is transitioning to a terminal or unreachable state from a
non-terminal reachable state:
https://github.com/apache/mesos/blob/9c20fc4d93710e2314d8974ef37d56a33a5ad884/src/master/master.cpp#L11203
However, there is a very suspicious line in one of your logs:
{noformat}
I1102 11:40:04.218107 6496 master.cpp:11092] Updating the state of task
083d7291-b8d7-418a-b3ed-233e06184040 of framework
b98761e9-2e84-4971-b678-13b6619b18e1 (latest state: TASK_KILLED, status update
state: TASK_KILLING)
{noformat}
This does not look like an immediate cause of the crash that occurs after that
(the role does not match), but looks concerning.
This line means that, from the master's point of view, this task transitioned
*from* a terminal *into* a non-terminal state.
Given that each transition from a reachable non-terminal state into terminal or
unreachable should be accompanied by a call to `recoverResources()`, this means
that master probably performs double untracking of task resources in some cases.
I'm not sure why exactly TASK_KILLING update occurs for a task that, from the
master's point of view, is already in the killing state.
This might be caused by some agent bug, but I think that when the agent
reregisters, this might occur in absence of any bugs on the agent side.
Note that the master will store a task in a TASK_KILLED state until the
TASK_KILLED status update is acknowledged by the scheduler; this means that
killing tasks and tearing the scheduler down simultaneously probably
exacerbates the issue (i.e. increases the probability that the TASK_KILLING
that "follows" TASK_KILLED arrives before the task is actually removed from the
master).
> Mesos master failure "Check failed: 'get_(role)' Must be SOME"
> --------------------------------------------------------------
>
> Key: MESOS-10194
> URL: https://issues.apache.org/jira/browse/MESOS-10194
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.10.0, 1.11.0
> Reporter: Jerome Soussens
> Assignee: Andrei Sekretenko
> Priority: Critical
> Attachments: log_mesos_crash_role_13102020.txt,
> mesos_scalars_at_slaveId_crash.log
>
>
>
> *Impact* : mesos-master crash with log :
> {code:java}
> hierarchical.cpp:460] Check failed: 'get_(role)' Must be SOME
> {code}
> *Possible scenario :*
> A framework, using a specific role, is stopped. More or less at the same
> time, some remaining task status for this framework comes to the master from
> the executor. But the roles is no more listed.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)