[
https://issues.apache.org/jira/browse/MESOS-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589070#comment-13589070
]
Vinod Kone commented on MESOS-365:
----------------------------------
OK. So this is what happened.
1) master1 was the leading master, when slave with slaveId1 was running.
2) slave was restarted (for an upgrade).
3) Before master1 realized that the slave has exited, it sent a launch task
message.
4) The new slave came back up, but before it can register with master, it
received the launch task message (intended for the old save) because the pid
didn't change. It proceeded to launch the executor/task despite not having a
slave id!
5) Slave then registered with master1 and got slaveId2.
6) Master1 then crashed (due to an un-related bug) and a new master (master2)
got elected.
7) Slave re-registered with master2 with slaveId2 and the task it launched.
8) At this point master2 has a task (in its 'frameworks' struct) whose
task.slave_id() is slaveId1 but knows nothing about slaveId1 (its 'slaves' map
contains slaveID2).
9) Framework now issues a killTask() and master fails the CHECK (above) because
task.slave_id(), which is slaveId1, is not in 'slaves'.
Clearly, a quite a few stars were aligned to make this happen! Yay for
distributed systems! Probable fixes forthcoming.
> Master check failure.
> ---------------------
>
> Key: MESOS-365
> URL: https://issues.apache.org/jira/browse/MESOS-365
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Priority: Critical
>
> In a test cluster under scale testing, during a roll of the masters, one of
> the newly elected masters failed with this:
> I0227 23:50:48.406574 1584 master.cpp:822] Asked to kill task
> 1362008747374-wickman-seizure-4-933a8193-96b1-411f-9392-3e4bd2cda6f0 of
> framework 201103282247-0000000019-0000
> F0227 23:50:48.406697 1584 master.cpp:830] Check failed: slave != NULL
> *** Check failure stack trace: ***
> @ 0x7fb439418e6d google::LogMessage::Fail()
> @ 0x7fb43941ead7 google::LogMessage::SendToLog()
> @ 0x7fb43941a71c google::LogMessage::Flush()
> @ 0x7fb43941a986 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fb43908b176 mesos::internal::master::Master::killTask()
> @ 0x7fb4390c4645 ProtobufProcess<>::handler2<>()
> @ 0x7fb439090b27 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7fb4390c5b6b ProtobufProcess<>::visit()
> @ 0x7fb4392e2624 process::MessageEvent::visit()
> @ 0x7fb4392d68cd process::ProcessManager::resume()
> @ 0x7fb4392d7118 process::schedule()
> @ 0x7fb4389f573d start_thread
> @ 0x7fb4373d9f6d clone
> Looks like this CHECK is too aggressive, as it's possible for a newly rolled
> master to not have all of the slave's registered yet?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira