Thanks Alex, it is clear to me now :-) Regards, Qian Zhang
From: Alex Rukletsov <a...@mesosphere.com> To: dev <dev@mesos.apache.org> Date: 09/19/2015 21:12 Subject: Re: Why do we need slave_id in Kill message Inlined. On Sat, Sep 19, 2015 at 2:23 PM, Qian AZ Zhang <zhang...@cn.ibm.com> wrote: > Thanks Alex for your answer! > > I checked the code of Master::_reconcileTasks(), it looks like in the case > that the task is unknown to master and framework specifies the SlaveID of > the task, this method will check if the specified Slave is already > registered, if yes, it will send TASK_LOST to framework immediately. > However, if framework does not specify the SlaveID, master can only send > TASK_LOST to framework until there is no any transitioning slaves (this > can take minutes). So the specified SlaveID can actually speed up the reconciliation, > this should be the point, right? > Correct. > > > But I have a comment: in this design, we actually require the framework to > be cooperative, i.e., framework needs to ensure the SlaveID it specifies is > correct, otherwise, if it specifies it wrongly for some reasons, master > will also send a TASK_LOST to framework immediately, but it may not be the > correct behavior since the task may still be running in the correct slave > which has not reregistered yet. > I think about it this way. If a framework sends erroneous input it may get whatever answer. If the pair <SlaveID1, TaskID> does not exist, the master may react with TASK_LOST even if there exists a pair <SlaveID2, TaskID>. Moreover, I would advocate that the master _must_ react with TASK_LOST in this case, to enable the framework to detect corrupted state as early as possible. Generally, we "protect against Murphy and not against Machiavelli" and assume frameworks collaborate. > > > > Regards, > Qian Zhang > > [image: Inactive hide details for Alex Rukletsov ---09/16/2015 > 00:25:03---The last comment (the one you cite) comes from a person, whos]Alex > Rukletsov ---09/16/2015 00:25:03---The last comment (the one you cite) > comes from a person, whose questions and answers you'd better ve > > From: Alex Rukletsov <a...@mesosphere.com> > To: dev <dev@mesos.apache.org> > Date: 09/16/2015 00:25 > Subject: Re: Why do we need slave_id in Kill message > ------------------------------ > > > > The last comment (the one you cite) comes from a person, whose questions > and answers you'd better verify : ). Having said that, let me try to answer > your initial question. > > The master does not always know the SlaveID for each task. Imagine a master > failover. In the registry there is just the a of connected slaves before > the previous master crashed. The mapping TaskID -> SlaveID is restored > during slave re-registration. If a framework specifies the SlaveID of the > task it wants to reconcile or kill, the master can check if the > corresponding slave has already reregistered and if so, execute the request > immediately. If the SlaveID is unknown, the master cannot really execute > the request until all slaves reregister. > > Regarding Reconcile message: check this > commit: f95fa119044c9a11c8473ab088e948e7e1c1334d. It looks like we should > update reconciliation doc [1]. Jan Schlicht, is it something you have > cycles for? > > https://mesos.apache.org/documentation/latest/reconciliation/ > > > On Tue, Sep 15, 2015 at 4:35 PM, Qian AZ Zhang <zhang...@cn.ibm.com> > wrote: > > > Thanks Alex. I checked the the comments in MESOS-1127, and based on the > > last comment (see below), it seems the question is still open ... > > > For me it looks like we can deduce SlaveID from TaskID modulo we have > > to wait for transitionary slaves. If this is the case, providing just > > TaskID in Kill and Reconcile requests simplifies framework design and > > allows us to get rid of validating requests in master against {[SlaveID}} > > mismatch. Does this make sense? > > > > > > BTW, I checked the "Reconcile" message (see below) in scheduler.proto, > and > > found a field "statuses" is mentioned in its comments, however, I do not > > see such field in the "Reconcile" message, so I think the comments might > > not be correct, actually it should be "tasks" field? > > // Allows the scheduler to query the status for non-terminal tasks. > > // This causes the master to send back the latest task status for > > // each task in 'tasks', if possible. Tasks that are no longer known > > // will result in a TASK_LOST update. *If 'statuses' is empty*, then > > // the master will send the latest status for each task currently > > // known. > > message Reconcile { > > // TODO(vinod): Support arbitrary queries than just state of tasks. > > message Task { > > required TaskID task_id = 1; > > optional AgentID agent_id = 2; > > } > > > > repeated Task tasks = 1; > > } > > > > > > Regards, > > Qian Zhang > > > > [image: Inactive hide details for Alex Rukletsov ---09/15/2015 > > 20:52:26---I asked the same question some time ago and got a good > explan]Alex > > Rukletsov ---09/15/2015 20:52:26---I asked the same question some time > ago > > and got a good explanation from Ben Mahler. Take a look at l > > > > From: Alex Rukletsov <a...@mesosphere.com> > > To: dev <dev@mesos.apache.org> > > Date: 09/15/2015 20:52 > > Subject: Re: Why do we need slave_id in Kill message > > ------------------------------ > > > > > > > > I asked the same question some time ago and got a good explanation from > Ben > > Mahler. Take a look at last comments in MESOS-1127 > > <https://issues.apache.org/jira/browse/MESOS-1127> and maybe even > comments > > in review requests. > > > > Since the same question comes (at least) for the second time, maybe it > > makes sense to persist the answer somewhere (a comment in the protobuf). > > > > On Tue, Sep 15, 2015 at 11:55 AM, Klaus Ma <kl...@cguru.net> wrote: > > > > > I think this slave_id is used for status sync up/double check. In > master, > > > it'll check whether the special slave_id is equal to task's slave id; > if > > > not equal, master log message and ignore kill request. > > > > > > > > > On 2015年09月15日 17:46, Qian AZ Zhang wrote: > > > > > >> Hi, > > >> > > >> In Kill message (scheduler.proto), I found there is a slave_id field: > > >> message Kill { > > >> required TaskID task_id = 1; > > >> optional SlaveID slave_id = 2; > > >> } > > >> > > >> I am just wondering in which case framework needs to specify this > field > > >> when it kills a task, I think master should know the slave id of each > > >> task, > > >> can we just use the info in master? > > >> > > >> > > >> Regards, > > >> Qian Zhang > > >> > > > > > > -- > > > Klaus Ma (马达), PMP® | http://www.cguru.net > > > > > > > > > > > >