[ https://issues.apache.org/jira/browse/MESOS-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373400#comment-16373400 ]
Chun-Hung Hsiao commented on MESOS-8601: ---------------------------------------- The framework is added after the {{Slave}} struct is created because it uses the task list in the structure to build its own task list. Seems we need to find a way to notify subscribers without getting {{FrameworkInfo}} from {{getFramework()}}. > Master crashes during slave reregistration after failover. > ---------------------------------------------------------- > > Key: MESOS-8601 > URL: https://issues.apache.org/jira/browse/MESOS-8601 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.5.0 > Reporter: Chun-Hung Hsiao > Assignee: Chun-Hung Hsiao > Priority: Blocker > Labels: master > > The following happened after a master failover. > During slave reregistration, new tasks were added and the new leading master > notified all of its subscribers, and triggered the following check failure: > {noformat} > F0222 15:53:44.440387 2805 master.cpp:11190] Check failed: 'framework' Must > be non NULL > *** Check failure stack trace: *** > @ 0x7f1357be521d google::LogMessage::Fail() > @ 0x7f1357be704d google::LogMessage::SendToLog() > @ 0x7f1357be4e0c google::LogMessage::Flush() > @ 0x7f1357be7949 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f1356c80e2d google::CheckNotNull<>() > @ 0x7f1356ce2666 mesos::internal::master::Master::Subscribers::send() > @ 0x7f1356cece83 mesos::internal::master::Slave::addTask() > @ 0x7f1356cf3206 mesos::internal::master::Slave::Slave() > @ 0x7f1356cf5b90 mesos::internal::master::Master::__reregisterSlave() > @ 0x7f1356d02cf8 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f1357b43761 process::ProcessBase::consume() > @ 0x7f1357b5248c process::ProcessManager::resume() > @ 0x7f1357b579f6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f1354e6c230 (unknown) > @ 0x7f135468ae25 start_thread > @ 0x7f13543b834d __clone > {noformat} > This was because the master tried to get the framework info when sending the > notification: > https://github.com/apache/mesos/blob/1.5.x/src/master/master.cpp#L11190 > But it added the framework after that: > https://github.com/apache/mesos/blob/1.5.x/src/master/master.cpp#L6963 -- This message was sent by Atlassian JIRA (v7.6.3#76005)