[jira] [Commented] (MESOS-365) Master check failure.

Benjamin Hindman (JIRA) Thu, 28 Feb 2013 15:25:14 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590038#comment-13590038
 ]


Benjamin Hindman commented on MESOS-365:
----------------------------------------

--> The slave should reject launch task requests, when the slave id in the task 
does not match its id. Note that this also captures the case when the slave has 
not gotten an id yet.

YES!


--> The above rejection, actually entails sending a TASK_LOST to 
master/framework. Since the master might be down when this happens, this has to 
be reliable. This should be solved by the StatusUpdateManager which is being 
implemented as part of slave restart.

Eh, I'm not so sure this has to be reliable. If we can guarantee idempotency of 
whatever we send the first time, then we don't need to persist this.


--> There should also be a check in the master to ensure that the tasks a slave 
re-registers with have the correct slave id.

Yes! And I think this can even just be a CHECK.


--> Kinda related to this, the executor should get the slave id from 
environment (similar to how it gets framework id) when it starts up, instead of 
getting it vial registered message (current solution). I think Andy already 
filed a ticket for this. This avoids executors sending status updates with an 
un-initialized slave ids in them.

Should the executor actually be able to send status updates before it gets 
registered?
                
> Master check failure.
> ---------------------
>
>                 Key: MESOS-365
>                 URL: https://issues.apache.org/jira/browse/MESOS-365
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Vinod Kone
>            Priority: Critical
>
> In a test cluster under scale testing, during a roll of the masters, one of 
> the newly elected masters failed with this:
> I0227 23:50:48.406574  1584 master.cpp:822] Asked to kill task 
> 1362008747374-wickman-seizure-4-933a8193-96b1-411f-9392-3e4bd2cda6f0 of 
> framework 201103282247-0000000019-0000
> F0227 23:50:48.406697  1584 master.cpp:830] Check failed: slave != NULL 
> *** Check failure stack trace: ***
>     @     0x7fb439418e6d  google::LogMessage::Fail()
>     @     0x7fb43941ead7  google::LogMessage::SendToLog()
>     @     0x7fb43941a71c  google::LogMessage::Flush()
>     @     0x7fb43941a986  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb43908b176  mesos::internal::master::Master::killTask()
>     @     0x7fb4390c4645  ProtobufProcess<>::handler2<>()
>     @     0x7fb439090b27  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7fb4390c5b6b  ProtobufProcess<>::visit()
>     @     0x7fb4392e2624  process::MessageEvent::visit()
>     @     0x7fb4392d68cd  process::ProcessManager::resume()
>     @     0x7fb4392d7118  process::schedule()
>     @     0x7fb4389f573d  start_thread
>     @     0x7fb4373d9f6d  clone
> Looks like this CHECK is too aggressive, as it's possible for a newly rolled 
> master to not have all of the slave's registered yet?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-365) Master check failure.

Reply via email to