[ 
https://issues.apache.org/jira/browse/MESOS-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589070#comment-13589070
 ] 

Vinod Kone commented on MESOS-365:
----------------------------------

OK. So this is what happened.

1) master1 was the leading master, when slave with slaveId1 was running.

2) slave was restarted (for an upgrade).

3) Before master1 realized that the slave has exited, it sent a launch task 
message.

4) The new slave came back up, but before it can register with master, it 
received the launch task message (intended for the old save) because the pid 
didn't change. It proceeded to launch the executor/task despite not having a 
slave id!

5) Slave then registered with master1 and got slaveId2.

6) Master1 then crashed (due to an un-related bug) and a new master (master2) 
got elected.

7) Slave re-registered with master2 with slaveId2 and the task it launched.

8) At this point master2 has a task (in its 'frameworks' struct) whose 
task.slave_id() is slaveId1 but knows nothing about slaveId1 (its 'slaves' map 
contains slaveID2).

9) Framework now issues a killTask() and master fails the CHECK (above) because 
task.slave_id(), which is slaveId1, is not in 'slaves'.

Clearly, a quite a few stars were aligned to make this happen! Yay for 
distributed systems! Probable fixes forthcoming.

                
> Master check failure.
> ---------------------
>
>                 Key: MESOS-365
>                 URL: https://issues.apache.org/jira/browse/MESOS-365
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Priority: Critical
>
> In a test cluster under scale testing, during a roll of the masters, one of 
> the newly elected masters failed with this:
> I0227 23:50:48.406574  1584 master.cpp:822] Asked to kill task 
> 1362008747374-wickman-seizure-4-933a8193-96b1-411f-9392-3e4bd2cda6f0 of 
> framework 201103282247-0000000019-0000
> F0227 23:50:48.406697  1584 master.cpp:830] Check failed: slave != NULL 
> *** Check failure stack trace: ***
>     @     0x7fb439418e6d  google::LogMessage::Fail()
>     @     0x7fb43941ead7  google::LogMessage::SendToLog()
>     @     0x7fb43941a71c  google::LogMessage::Flush()
>     @     0x7fb43941a986  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb43908b176  mesos::internal::master::Master::killTask()
>     @     0x7fb4390c4645  ProtobufProcess<>::handler2<>()
>     @     0x7fb439090b27  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7fb4390c5b6b  ProtobufProcess<>::visit()
>     @     0x7fb4392e2624  process::MessageEvent::visit()
>     @     0x7fb4392d68cd  process::ProcessManager::resume()
>     @     0x7fb4392d7118  process::schedule()
>     @     0x7fb4389f573d  start_thread
>     @     0x7fb4373d9f6d  clone
> Looks like this CHECK is too aggressive, as it's possible for a newly rolled 
> master to not have all of the slave's registered yet?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to