Thanks Alex for your answer! I checked the code of Master::_reconcileTasks(), it looks like in the case that the task is unknown to master and framework specifies the SlaveID of the task, this method will check if the specified Slave is already registered, if yes, it will send TASK_LOST to framework immediately. However, if framework does not specify the SlaveID, master can only send TASK_LOST to framework until there is no any transitioning slaves (this can take minutes). So the specified SlaveID can actually speed up the reconciliation, this should be the point, right?
But I have a comment: in this design, we actually require the framework to be cooperative, i.e., framework needs to ensure the SlaveID it specifies is correct, otherwise, if it specifies it wrongly for some reasons, master will also send a TASK_LOST to framework immediately, but it may not be the correct behavior since the task may still be running in the correct slave which has not reregistered yet. Regards, Qian Zhang From: Alex Rukletsov <a...@mesosphere.com> To: dev <dev@mesos.apache.org> Date: 09/16/2015 00:25 Subject: Re: Why do we need slave_id in Kill message The last comment (the one you cite) comes from a person, whose questions and answers you'd better verify : ). Having said that, let me try to answer your initial question. The master does not always know the SlaveID for each task. Imagine a master failover. In the registry there is just the a of connected slaves before the previous master crashed. The mapping TaskID -> SlaveID is restored during slave re-registration. If a framework specifies the SlaveID of the task it wants to reconcile or kill, the master can check if the corresponding slave has already reregistered and if so, execute the request immediately. If the SlaveID is unknown, the master cannot really execute the request until all slaves reregister. Regarding Reconcile message: check this commit: f95fa119044c9a11c8473ab088e948e7e1c1334d. It looks like we should update reconciliation doc [1]. Jan Schlicht, is it something you have cycles for? https://mesos.apache.org/documentation/latest/reconciliation/ On Tue, Sep 15, 2015 at 4:35 PM, Qian AZ Zhang <zhang...@cn.ibm.com> wrote: > Thanks Alex. I checked the the comments in MESOS-1127, and based on the > last comment (see below), it seems the question is still open ... > > For me it looks like we can deduce SlaveID from TaskID modulo we have > to wait for transitionary slaves. If this is the case, providing just > TaskID in Kill and Reconcile requests simplifies framework design and > allows us to get rid of validating requests in master against {[SlaveID}} > mismatch. Does this make sense? > > > BTW, I checked the "Reconcile" message (see below) in scheduler.proto, and > found a field "statuses" is mentioned in its comments, however, I do not > see such field in the "Reconcile" message, so I think the comments might > not be correct, actually it should be "tasks" field? > // Allows the scheduler to query the status for non-terminal tasks. > // This causes the master to send back the latest task status for > // each task in 'tasks', if possible. Tasks that are no longer known > // will result in a TASK_LOST update. *If 'statuses' is empty*, then > // the master will send the latest status for each task currently > // known. > message Reconcile { > // TODO(vinod): Support arbitrary queries than just state of tasks. > message Task { > required TaskID task_id = 1; > optional AgentID agent_id = 2; > } > > repeated Task tasks = 1; > } > > > Regards, > Qian Zhang > > [image: Inactive hide details for Alex Rukletsov ---09/15/2015 > 20:52:26---I asked the same question some time ago and got a good explan]Alex > Rukletsov ---09/15/2015 20:52:26---I asked the same question some time ago > and got a good explanation from Ben Mahler. Take a look at l > > From: Alex Rukletsov <a...@mesosphere.com> > To: dev <dev@mesos.apache.org> > Date: 09/15/2015 20:52 > Subject: Re: Why do we need slave_id in Kill message > ------------------------------ > > > > I asked the same question some time ago and got a good explanation from Ben > Mahler. Take a look at last comments in MESOS-1127 > <https://issues.apache.org/jira/browse/MESOS-1127> and maybe even comments > in review requests. > > Since the same question comes (at least) for the second time, maybe it > makes sense to persist the answer somewhere (a comment in the protobuf). > > On Tue, Sep 15, 2015 at 11:55 AM, Klaus Ma <kl...@cguru.net> wrote: > > > I think this slave_id is used for status sync up/double check. In master, > > it'll check whether the special slave_id is equal to task's slave id; if > > not equal, master log message and ignore kill request. > > > > > > On 2015年09月15日 17:46, Qian AZ Zhang wrote: > > > >> Hi, > >> > >> In Kill message (scheduler.proto), I found there is a slave_id field: > >> message Kill { > >> required TaskID task_id = 1; > >> optional SlaveID slave_id = 2; > >> } > >> > >> I am just wondering in which case framework needs to specify this field > >> when it kills a task, I think master should know the slave id of each > >> task, > >> can we just use the info in master? > >> > >> > >> Regards, > >> Qian Zhang > >> > > > > -- > > Klaus Ma (马达), PMP® | http://www.cguru.net > > > > > >