Thanks Alex, it is clear to me now :-)

Regards,
Qian Zhang



From:   Alex Rukletsov <a...@mesosphere.com>
To:     dev <dev@mesos.apache.org>
Date:   09/19/2015 21:12
Subject:        Re: Why do we need slave_id in Kill message



Inlined.

On Sat, Sep 19, 2015 at 2:23 PM, Qian AZ Zhang <zhang...@cn.ibm.com> wrote:

> Thanks Alex for your answer!
>
> I checked the code of Master::_reconcileTasks(), it looks like in the
case
> that the task is unknown to master and framework specifies the SlaveID of
> the task, this method will check if the specified Slave is already
> registered, if yes, it will send TASK_LOST to framework immediately.
> However, if framework does not specify the SlaveID, master can only send
> TASK_LOST to framework until there is no any transitioning slaves (this
> can take minutes). So the specified SlaveID can actually speed up the
reconciliation,
> this should be the point, right?
>
Correct.

>
>
> But I have a comment: in this design, we actually require the framework
to
> be cooperative, i.e., framework needs to ensure the SlaveID it specifies
is
> correct, otherwise, if it specifies it wrongly for some reasons, master
> will also send a TASK_LOST to framework immediately, but it may not be
the
> correct behavior since the task may still be running in the correct slave
> which has not reregistered yet.
>
I think about it this way. If a framework sends erroneous input it may get
whatever answer. If the pair <SlaveID1, TaskID> does not exist, the master
may react with TASK_LOST even if there exists a pair <SlaveID2, TaskID>.
Moreover, I would advocate that the master _must_ react with TASK_LOST in
this case, to enable the framework to detect corrupted state as early as
possible.

Generally, we "protect against Murphy and not against Machiavelli" and
assume frameworks collaborate.

>
>
>
> Regards,
> Qian Zhang
>
> [image: Inactive hide details for Alex Rukletsov ---09/16/2015
> 00:25:03---The last comment (the one you cite) comes from a person,
whos]Alex
> Rukletsov ---09/16/2015 00:25:03---The last comment (the one you cite)
> comes from a person, whose questions and answers you'd better ve
>
> From: Alex Rukletsov <a...@mesosphere.com>
> To: dev <dev@mesos.apache.org>
> Date: 09/16/2015 00:25
> Subject: Re: Why do we need slave_id in Kill message
> ------------------------------
>
>
>
> The last comment (the one you cite) comes from a person, whose questions
> and answers you'd better verify : ). Having said that, let me try to
answer
> your initial question.
>
> The master does not always know the SlaveID for each task. Imagine a
master
> failover. In the registry there is just the a of connected slaves before
> the previous master crashed. The mapping TaskID -> SlaveID is restored
> during slave re-registration. If a framework specifies the SlaveID of the
> task it wants to reconcile or kill, the master can check if the
> corresponding slave has already reregistered and if so, execute the
request
> immediately. If the SlaveID is unknown, the master cannot really execute
> the request until all slaves reregister.
>
> Regarding Reconcile message: check this
> commit: f95fa119044c9a11c8473ab088e948e7e1c1334d. It looks like we should
> update reconciliation doc [1]. Jan Schlicht, is it something you have
> cycles for?
>
> https://mesos.apache.org/documentation/latest/reconciliation/
>
>
> On Tue, Sep 15, 2015 at 4:35 PM, Qian AZ Zhang <zhang...@cn.ibm.com>
> wrote:
>
> > Thanks Alex. I checked the the comments in MESOS-1127, and based on the
> > last comment (see below), it seems the question is still open ...
> > > For me it looks like we can deduce SlaveID from TaskID modulo we have
> > to wait for transitionary slaves. If this is the case, providing just
> > TaskID in Kill and Reconcile requests simplifies framework design and
> > allows us to get rid of validating requests in master against
{[SlaveID}}
> > mismatch. Does this make sense?
> >
> >
> > BTW, I checked the "Reconcile" message (see below) in scheduler.proto,
> and
> > found a field "statuses" is mentioned in its comments, however, I do
not
> > see such field in the "Reconcile" message, so I think the comments
might
> > not be correct, actually it should be "tasks" field?
> >   // Allows the scheduler to query the status for non-terminal tasks.
> >   // This causes the master to send back the latest task status for
> >   // each task in 'tasks', if possible. Tasks that are no longer known
> >   // will result in a TASK_LOST update. *If 'statuses' is empty*, then
> >   // the master will send the latest status for each task currently
> >   // known.
> >   message Reconcile {
> >    // TODO(vinod): Support arbitrary queries than just state of tasks.
> >     message Task {
> >       required TaskID task_id = 1;
> >       optional AgentID agent_id = 2;
> >     }
> >
> >     repeated Task tasks = 1;
> >   }
> >
> >
> > Regards,
> > Qian Zhang
> >
> > [image: Inactive hide details for Alex Rukletsov ---09/15/2015
> > 20:52:26---I asked the same question some time ago and got a good
> explan]Alex
> > Rukletsov ---09/15/2015 20:52:26---I asked the same question some time
> ago
> > and got a good explanation from Ben Mahler. Take a look at l
> >
> > From: Alex Rukletsov <a...@mesosphere.com>
> > To: dev <dev@mesos.apache.org>
> > Date: 09/15/2015 20:52
> > Subject: Re: Why do we need slave_id in Kill message
> > ------------------------------
> >
> >
> >
> > I asked the same question some time ago and got a good explanation from
> Ben
> > Mahler. Take a look at last comments in MESOS-1127
> > <https://issues.apache.org/jira/browse/MESOS-1127> and maybe even
> comments
> > in review requests.
> >
> > Since the same question comes (at least) for the second time, maybe it
> > makes sense to persist the answer somewhere (a comment in the
protobuf).
> >
> > On Tue, Sep 15, 2015 at 11:55 AM, Klaus Ma <kl...@cguru.net> wrote:
> >
> > > I think this slave_id is used for status sync up/double check. In
> master,
> > > it'll check whether the special slave_id is equal to task's slave id;
> if
> > > not equal, master log message and ignore kill request.
> > >
> > >
> > > On 2015年09月15日 17:46, Qian AZ Zhang wrote:
> > >
> > >> Hi,
> > >>
> > >> In Kill message (scheduler.proto), I found there is a slave_id
field:
> > >>    message Kill {
> > >>      required TaskID task_id = 1;
> > >>      optional SlaveID slave_id = 2;
> > >>    }
> > >>
> > >> I am just wondering in which case framework needs to specify this
> field
> > >> when it kills a task, I think master should know the slave id of
each
> > >> task,
> > >> can we just use the info in master?
> > >>
> > >>
> > >> Regards,
> > >> Qian Zhang
> > >>
> > >
> > > --
> > > Klaus Ma (马达), PMP® | http://www.cguru.net
> > >
> > >
> >
> >
>
>

Reply via email to