[jira] [Commented] (MESOS-10194) Mesos master failure "Check failed: 'get_(role)' Must be SOME"

Andrei Sekretenko (Jira) Fri, 13 Nov 2020 06:50:23 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231541#comment-17231541
 ]


Andrei Sekretenko commented on MESOS-10194:
-------------------------------------------

[~Jerome Soussens] Two terminal status updates should not be a problem. 
Calling `recoverResources()` for a terminal status update is guarded by a check 
that the task is transitioning to a terminal or unreachable state from a 
non-terminal reachable state:
https://github.com/apache/mesos/blob/9c20fc4d93710e2314d8974ef37d56a33a5ad884/src/master/master.cpp#L11203

However, there is a very suspicious line in one of your logs:
{noformat}
I1102 11:40:04.218107  6496 master.cpp:11092] Updating the state of task 
083d7291-b8d7-418a-b3ed-233e06184040 of framework 
b98761e9-2e84-4971-b678-13b6619b18e1 (latest state: TASK_KILLED, status update 
state: TASK_KILLING)
{noformat}
This does not look like an immediate cause of the crash that occurs after that 
(the role does not match), but looks concerning.

This line means that, from the master's point of view, this task transitioned 
*from* a terminal *into* a non-terminal state.
Given that each transition from a reachable non-terminal state into terminal or 
unreachable should be accompanied by a call to `recoverResources()`, this means 
that master probably performs double untracking of task resources in some cases.

I'm not sure why exactly TASK_KILLING update occurs for a task that, from the 
master's point of view, is already in the killing state. 
This might be caused by some agent bug, but I think that when the agent 
reregisters, this might occur in absence of any bugs on the agent side.

Note that the master will store a task in a TASK_KILLED state until the 
TASK_KILLED status update is acknowledged by the scheduler; this means that 
killing tasks and tearing the scheduler down simultaneously probably 
exacerbates the issue (i.e. increases the probability that the TASK_KILLING 
that "follows" TASK_KILLED arrives before the task is actually removed from the 
master).


> Mesos master failure "Check failed: 'get_(role)' Must be SOME"
> --------------------------------------------------------------
>
>                 Key: MESOS-10194
>                 URL: https://issues.apache.org/jira/browse/MESOS-10194
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.10.0, 1.11.0
>            Reporter: Jerome Soussens
>            Assignee: Andrei Sekretenko
>            Priority: Critical
>         Attachments: log_mesos_crash_role_13102020.txt, 
> mesos_scalars_at_slaveId_crash.log
>
>
>  
> *Impact* : mesos-master crash with log :
> {code:java}
> hierarchical.cpp:460] Check failed: 'get_(role)' Must be SOME
> {code}
> *Possible scenario :*
> A framework, using a specific role, is stopped. More or less at the same 
> time, some remaining task status for this framework comes to the master from 
> the executor. But the roles is no more listed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10194) Mesos master failure "Check failed: 'get_(role)' Must be SOME"

Reply via email to