[
https://issues.apache.org/jira/browse/MESOS-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590038#comment-13590038
]
Benjamin Hindman commented on MESOS-365:
----------------------------------------
--> The slave should reject launch task requests, when the slave id in the task
does not match its id. Note that this also captures the case when the slave has
not gotten an id yet.
YES!
--> The above rejection, actually entails sending a TASK_LOST to
master/framework. Since the master might be down when this happens, this has to
be reliable. This should be solved by the StatusUpdateManager which is being
implemented as part of slave restart.
Eh, I'm not so sure this has to be reliable. If we can guarantee idempotency of
whatever we send the first time, then we don't need to persist this.
--> There should also be a check in the master to ensure that the tasks a slave
re-registers with have the correct slave id.
Yes! And I think this can even just be a CHECK.
--> Kinda related to this, the executor should get the slave id from
environment (similar to how it gets framework id) when it starts up, instead of
getting it vial registered message (current solution). I think Andy already
filed a ticket for this. This avoids executors sending status updates with an
un-initialized slave ids in them.
Should the executor actually be able to send status updates before it gets
registered?
> Master check failure.
> ---------------------
>
> Key: MESOS-365
> URL: https://issues.apache.org/jira/browse/MESOS-365
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Assignee: Vinod Kone
> Priority: Critical
>
> In a test cluster under scale testing, during a roll of the masters, one of
> the newly elected masters failed with this:
> I0227 23:50:48.406574 1584 master.cpp:822] Asked to kill task
> 1362008747374-wickman-seizure-4-933a8193-96b1-411f-9392-3e4bd2cda6f0 of
> framework 201103282247-0000000019-0000
> F0227 23:50:48.406697 1584 master.cpp:830] Check failed: slave != NULL
> *** Check failure stack trace: ***
> @ 0x7fb439418e6d google::LogMessage::Fail()
> @ 0x7fb43941ead7 google::LogMessage::SendToLog()
> @ 0x7fb43941a71c google::LogMessage::Flush()
> @ 0x7fb43941a986 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fb43908b176 mesos::internal::master::Master::killTask()
> @ 0x7fb4390c4645 ProtobufProcess<>::handler2<>()
> @ 0x7fb439090b27 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7fb4390c5b6b ProtobufProcess<>::visit()
> @ 0x7fb4392e2624 process::MessageEvent::visit()
> @ 0x7fb4392d68cd process::ProcessManager::resume()
> @ 0x7fb4392d7118 process::schedule()
> @ 0x7fb4389f573d start_thread
> @ 0x7fb4373d9f6d clone
> Looks like this CHECK is too aggressive, as it's possible for a newly rolled
> master to not have all of the slave's registered yet?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira