[ https://issues.apache.org/jira/browse/MESOS-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231541#comment-17231541 ]
Andrei Sekretenko edited comment on MESOS-10194 at 11/13/20, 2:55 PM: ---------------------------------------------------------------------- [~Jerome Soussens] Two terminal status updates should not be a problem. Calling `recoverResources()` for a terminal status update is guarded by a check that the task is transitioning to a terminal or unreachable state from a non-terminal reachable state: https://github.com/apache/mesos/blob/9c20fc4d93710e2314d8974ef37d56a33a5ad884/src/master/master.cpp#L11203 However, there is a very suspicious line in one of your logs: {noformat} I1102 11:40:04.218107 6496 master.cpp:11092] Updating the state of task 083d7291-b8d7-418a-b3ed-233e06184040 of framework b98761e9-2e84-4971-b678-13b6619b18e1 (latest state: TASK_KILLED, status update state: TASK_KILLING) {noformat} This does not look like an immediate cause of the crash that occurs after that (the role does not match), but looks concerning. This line means that, from the master's point of view, this task transitioned *from* a terminal *into* a non-terminal state. Given that each transition from a reachable non-terminal state into terminal or unreachable should be accompanied by a call to `recoverResources()`, this means that master probably performs double untracking of task resources in some cases. I'm not sure why exactly TASK_KILLING update occurs for a task that, from the master's point of view, is already in the killing state. This might be caused by some agent bug, but I think that when the agent reregisters, this might occur in absence of any bugs on the agent side. Note that the master will store a task in a TASK_KILLED state until the TASK_KILLED status update is acknowledged by the scheduler or the framework is torn down. was (Author: asekretenko): [~Jerome Soussens] Two terminal status updates should not be a problem. Calling `recoverResources()` for a terminal status update is guarded by a check that the task is transitioning to a terminal or unreachable state from a non-terminal reachable state: https://github.com/apache/mesos/blob/9c20fc4d93710e2314d8974ef37d56a33a5ad884/src/master/master.cpp#L11203 However, there is a very suspicious line in one of your logs: {noformat} I1102 11:40:04.218107 6496 master.cpp:11092] Updating the state of task 083d7291-b8d7-418a-b3ed-233e06184040 of framework b98761e9-2e84-4971-b678-13b6619b18e1 (latest state: TASK_KILLED, status update state: TASK_KILLING) {noformat} This does not look like an immediate cause of the crash that occurs after that (the role does not match), but looks concerning. This line means that, from the master's point of view, this task transitioned *from* a terminal *into* a non-terminal state. Given that each transition from a reachable non-terminal state into terminal or unreachable should be accompanied by a call to `recoverResources()`, this means that master probably performs double untracking of task resources in some cases. I'm not sure why exactly TASK_KILLING update occurs for a task that, from the master's point of view, is already in the killing state. This might be caused by some agent bug, but I think that when the agent reregisters, this might occur in absence of any bugs on the agent side. Note that the master will store a task in a TASK_KILLED state until the TASK_KILLED status update is acknowledged by the scheduler; this means that killing tasks and tearing the scheduler down simultaneously probably exacerbates the issue (i.e. increases the probability that the TASK_KILLING that "follows" TASK_KILLED arrives before the task is actually removed from the master). > Mesos master failure "Check failed: 'get_(role)' Must be SOME" > -------------------------------------------------------------- > > Key: MESOS-10194 > URL: https://issues.apache.org/jira/browse/MESOS-10194 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.10.0, 1.11.0 > Reporter: Jerome Soussens > Assignee: Andrei Sekretenko > Priority: Critical > Attachments: log_mesos_crash_role_13102020.txt, > mesos_scalars_at_slaveId_crash.log > > > > *Impact* : mesos-master crash with log : > {code:java} > hierarchical.cpp:460] Check failed: 'get_(role)' Must be SOME > {code} > *Possible scenario :* > A framework, using a specific role, is stopped. More or less at the same > time, some remaining task status for this framework comes to the master from > the executor. But the roles is no more listed. > -- This message was sent by Atlassian Jira (v8.3.4#803005)