Yan Xu commented on MESOS-10011:

In our environment we only use old style RESERVE/CREATE persistent volumes and 
a plausible case is that if the scheduler fails to acknowledge the operation 
feedback so the check pointed update still exists with the original agent ID in 

After the agent losing its state, because 
 lives outside the 
 state, the unacked operations don't get cleaned up and now have the stale 
agent ID in them.

> Operation feedback with stale agent ID crashes the master
> ---------------------------------------------------------
>                 Key: MESOS-10011
>                 URL: https://issues.apache.org/jira/browse/MESOS-10011
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.9.0
>            Reporter: Yan Xu
>            Priority: Critical
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
> f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
>     @     0x7fd36ca9cf4d  google::LogMessage::Fail()
>     @     0x7fd36ca9f13d  google::LogMessage::SendToLog()
>     @     0x7fd36ca9ca87  google::LogMessage::Flush()
>     @     0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
>     @     0x7fd36b5b3446  
> mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to 
> losing its local state.
> The check failure code is inĀ 
> [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state 
> (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
>  Even thought we try to minimize the cases where an agent would lose its 
> state, it can still happen when the {{latest}} symlink is removed either by 
> an operator or automatically [in certain 
> cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].

This message was sent by Atlassian Jira

Reply via email to