[
https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949656#comment-16949656
]
Yan Xu commented on MESOS-10011:
--------------------------------
{{removeOperation}} is probably called from
[here|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L9245]
because of these the operations don't have IDs.
{noformat}
I0918 23:15:32.563908 37981 slave.cpp:6285] Forwarding status update of
operation with no ID (operation_uuid: d2a369e9-ec7c-4be6-9bdb-8ab1961aa773) for
framework 9ead69cb-63b1-4986-968a-ecd99b7ba95d-2469
{noformat}
> Operation feedback with stale agent ID crashes the master
> ---------------------------------------------------------
>
> Key: MESOS-10011
> URL: https://issues.apache.org/jira/browse/MESOS-10011
> Project: Mesos
> Issue Type: Bug
> Components: agent, master
> Affects Versions: 1.9.0
> Reporter: Yan Xu
> Priority: Critical
>
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr
> f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
> @ 0x7fd36ca9cf4d google::LogMessage::Fail()
> @ 0x7fd36ca9f13d google::LogMessage::SendToLog()
> @ 0x7fd36ca9ca87 google::LogMessage::Flush()
> @ 0x7fd36ca9fbc9 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fd36b5ae3bc mesos::internal::master::Master::removeOperation()
> @ 0x7fd36b5b3446
> mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to
> losing its local state.
> The check failure code is inĀ
> [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state
> (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
> Even thought we try to minimize the cases where an agent would lose its
> state, it can still happen when the {{latest}} symlink is removed either by
> an operator or automatically [in certain
> cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)