Yan Xu created MESOS-10011:

             Summary: Operation feedback with stale agent ID crashes the master
                 Key: MESOS-10011
                 URL: https://issues.apache.org/jira/browse/MESOS-10011
             Project: Mesos
          Issue Type: Bug
          Components: agent, master
    Affects Versions: 1.9.0
            Reporter: Yan Xu

We have observed the following in our environment.
F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
*** Check failure stack trace: ***
    @     0x7fd36ca9cf4d  google::LogMessage::Fail()
    @     0x7fd36ca9f13d  google::LogMessage::SendToLog()
    @     0x7fd36ca9ca87  google::LogMessage::Flush()
    @     0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
    @     0x7fd36b5b3446  
This follows registration of an agent that has changed its agent ID due to 
losing its local state.

The check failure code is inĀ 

The masters would enter a crash loop unless the operation checkpoint state 
(i.e., {{resources_and_operations.state}}) on the offending agent is deleted.

 Even thought we try to minimize the cases where an agent would lose its state, 
it can still happen when the {{latest}} symlink is removed either by an 
operator or automatically [in certain 

This message was sent by Atlassian Jira

Reply via email to