Re: Why do we need slave_id in Kill message

Qian AZ Zhang Sat, 19 Sep 2015 05:25:14 -0700

Thanks Alex for your answer!

I checked the code of Master::_reconcileTasks(), it looks like in the case
that the task is unknown to master and framework specifies the SlaveID of
the task, this method will check if the specified Slave is already
registered, if yes, it will send TASK_LOST to framework immediately.
However, if framework does not specify the SlaveID, master can only send
TASK_LOST to framework until there is no any transitioning slaves (this can
take minutes). So the specified SlaveID can actually speed up the
reconciliation, this should be the point, right?


But I have a comment: in this design, we actually require the framework to
be cooperative, i.e., framework needs to ensure the SlaveID it specifies is
correct, otherwise, if it specifies it wrongly for some reasons, master
will also send a TASK_LOST to framework immediately, but it may not be the
correct behavior since the task may still be running in the correct slave
which has not reregistered yet.


Regards,
Qian Zhang



From:   Alex Rukletsov <a...@mesosphere.com>
To:     dev <dev@mesos.apache.org>
Date:   09/16/2015 00:25
Subject:        Re: Why do we need slave_id in Kill message



The last comment (the one you cite) comes from a person, whose questions
and answers you'd better verify : ). Having said that, let me try to answer
your initial question.

The master does not always know the SlaveID for each task. Imagine a master
failover. In the registry there is just the a of connected slaves before
the previous master crashed. The mapping TaskID -> SlaveID is restored
during slave re-registration. If a framework specifies the SlaveID of the
task it wants to reconcile or kill, the master can check if the
corresponding slave has already reregistered and if so, execute the request
immediately. If the SlaveID is unknown, the master cannot really execute
the request until all slaves reregister.

Regarding Reconcile message: check this
commit: f95fa119044c9a11c8473ab088e948e7e1c1334d. It looks like we should
update reconciliation doc [1]. Jan Schlicht, is it something you have
cycles for?

https://mesos.apache.org/documentation/latest/reconciliation/

On Tue, Sep 15, 2015 at 4:35 PM, Qian AZ Zhang <zhang...@cn.ibm.com> wrote:

> Thanks Alex. I checked the the comments in MESOS-1127, and based on the
> last comment (see below), it seems the question is still open ...
> > For me it looks like we can deduce SlaveID from TaskID modulo we have
> to wait for transitionary slaves. If this is the case, providing just
> TaskID in Kill and Reconcile requests simplifies framework design and
> allows us to get rid of validating requests in master against {[SlaveID}}
> mismatch. Does this make sense?
>
>
> BTW, I checked the "Reconcile" message (see below) in scheduler.proto,
and
> found a field "statuses" is mentioned in its comments, however, I do not
> see such field in the "Reconcile" message, so I think the comments might
> not be correct, actually it should be "tasks" field?
>   // Allows the scheduler to query the status for non-terminal tasks.
>   // This causes the master to send back the latest task status for
>   // each task in 'tasks', if possible. Tasks that are no longer known
>   // will result in a TASK_LOST update. *If 'statuses' is empty*, then
>   // the master will send the latest status for each task currently
>   // known.
>   message Reconcile {
>    // TODO(vinod): Support arbitrary queries than just state of tasks.
>     message Task {
>       required TaskID task_id = 1;
>       optional AgentID agent_id = 2;
>     }
>
>     repeated Task tasks = 1;
>   }
>
>
> Regards,
> Qian Zhang
>
> [image: Inactive hide details for Alex Rukletsov ---09/15/2015
> 20:52:26---I asked the same question some time ago and got a good
explan]Alex
> Rukletsov ---09/15/2015 20:52:26---I asked the same question some time
ago
> and got a good explanation from Ben Mahler. Take a look at l
>
> From: Alex Rukletsov <a...@mesosphere.com>
> To: dev <dev@mesos.apache.org>
> Date: 09/15/2015 20:52
> Subject: Re: Why do we need slave_id in Kill message
> ------------------------------
>
>
>
> I asked the same question some time ago and got a good explanation from
Ben
> Mahler. Take a look at last comments in MESOS-1127
> <https://issues.apache.org/jira/browse/MESOS-1127> and maybe even
comments
> in review requests.
>
> Since the same question comes (at least) for the second time, maybe it
> makes sense to persist the answer somewhere (a comment in the protobuf).
>
> On Tue, Sep 15, 2015 at 11:55 AM, Klaus Ma <kl...@cguru.net> wrote:
>
> > I think this slave_id is used for status sync up/double check. In
master,
> > it'll check whether the special slave_id is equal to task's slave id;
if
> > not equal, master log message and ignore kill request.
> >
> >
> > On 2015年09月15日 17:46, Qian AZ Zhang wrote:
> >
> >> Hi,
> >>
> >> In Kill message (scheduler.proto), I found there is a slave_id field:
> >>    message Kill {
> >>      required TaskID task_id = 1;
> >>      optional SlaveID slave_id = 2;
> >>    }
> >>
> >> I am just wondering in which case framework needs to specify this
field
> >> when it kills a task, I think master should know the slave id of each
> >> task,
> >> can we just use the info in master?
> >>
> >>
> >> Regards,
> >> Qian Zhang
> >>
> >
> > --
> > Klaus Ma (马达), PMP® | http://www.cguru.net
> >
> >
>
>

Re: Why do we need slave_id in Kill message

Reply via email to