Re: Why do we need slave_id in Kill message

Benjamin Mahler Wed, 23 Sep 2015 10:06:13 -0700

This decision also makes it easier for us to explore relaxing the globally
unique requirement on TaskID, instead requiring that <SlaveID, TaskID> be
unique, which makes some problems easier to solve. For example:
https://issues.apache.org/jira/browse/MESOS-3070. Also, it could allow us
to further scale the master if we were to split the master's workload onto
multiple actors / processes down the road. Just some food for thought.


On Sun, Sep 20, 2015 at 1:15 AM, Qian AZ Zhang <zhang...@cn.ibm.com> wrote:

> Thanks Alex, it is clear to me now :-)
>
> Regards,
> Qian Zhang
>
> [image: Inactive hide details for Alex Rukletsov ---09/19/2015
> 21:12:33---Inlined. On Sat, Sep 19, 2015 at 2:23 PM, Qian AZ Zhang <zhan]Alex
> Rukletsov ---09/19/2015 21:12:33---Inlined. On Sat, Sep 19, 2015 at 2:23
> PM, Qian AZ Zhang <zhang...@cn.ibm.com> wrote:
>
> From: Alex Rukletsov <a...@mesosphere.com>
> To: dev <dev@mesos.apache.org>
> Date: 09/19/2015 21:12
> Subject: Re: Why do we need slave_id in Kill message
> ------------------------------
>
>
>
> Inlined.
>
> On Sat, Sep 19, 2015 at 2:23 PM, Qian AZ Zhang <zhang...@cn.ibm.com>
> wrote:
>
> > Thanks Alex for your answer!
> >
> > I checked the code of Master::_reconcileTasks(), it looks like in the
> case
> > that the task is unknown to master and framework specifies the SlaveID of
> > the task, this method will check if the specified Slave is already
> > registered, if yes, it will send TASK_LOST to framework immediately.
> > However, if framework does not specify the SlaveID, master can only send
> > TASK_LOST to framework until there is no any transitioning slaves (this
> > can take minutes). So the specified SlaveID can actually speed up the
> reconciliation,
> > this should be the point, right?
> >
> Correct.
>
> >
> >
> > But I have a comment: in this design, we actually require the framework
> to
> > be cooperative, i.e., framework needs to ensure the SlaveID it specifies
> is
> > correct, otherwise, if it specifies it wrongly for some reasons, master
> > will also send a TASK_LOST to framework immediately, but it may not be
> the
> > correct behavior since the task may still be running in the correct slave
> > which has not reregistered yet.
> >
> I think about it this way. If a framework sends erroneous input it may get
> whatever answer. If the pair <SlaveID1, TaskID> does not exist, the master
> may react with TASK_LOST even if there exists a pair <SlaveID2, TaskID>.
> Moreover, I would advocate that the master _must_ react with TASK_LOST in
> this case, to enable the framework to detect corrupted state as early as
> possible.
>
> Generally, we "protect against Murphy and not against Machiavelli" and
> assume frameworks collaborate.
>
> >
> >
> >
> > Regards,
> > Qian Zhang
> >
> > [image: Inactive hide details for Alex Rukletsov ---09/16/2015
> > 00:25:03---The last comment (the one you cite) comes from a person,
> whos]Alex
> > Rukletsov ---09/16/2015 00:25:03---The last comment (the one you cite)
> > comes from a person, whose questions and answers you'd better ve
> >
> > From: Alex Rukletsov <a...@mesosphere.com>
> > To: dev <dev@mesos.apache.org>
> > Date: 09/16/2015 00:25
> > Subject: Re: Why do we need slave_id in Kill message
> > ------------------------------
> >
> >
> >
> > The last comment (the one you cite) comes from a person, whose questions
> > and answers you'd better verify : ). Having said that, let me try to
> answer
> > your initial question.
> >
> > The master does not always know the SlaveID for each task. Imagine a
> master
> > failover. In the registry there is just the a of connected slaves before
> > the previous master crashed. The mapping TaskID -> SlaveID is restored
> > during slave re-registration. If a framework specifies the SlaveID of the
> > task it wants to reconcile or kill, the master can check if the
> > corresponding slave has already reregistered and if so, execute the
> request
> > immediately. If the SlaveID is unknown, the master cannot really execute
> > the request until all slaves reregister.
> >
> > Regarding Reconcile message: check this
> > commit: f95fa119044c9a11c8473ab088e948e7e1c1334d. It looks like we should
> > update reconciliation doc [1]. Jan Schlicht, is it something you have
> > cycles for?
> >
> > https://mesos.apache.org/documentation/latest/reconciliation/
> >
> >
> > On Tue, Sep 15, 2015 at 4:35 PM, Qian AZ Zhang <zhang...@cn.ibm.com>
> > wrote:
> >
> > > Thanks Alex. I checked the the comments in MESOS-1127, and based on the
> > > last comment (see below), it seems the question is still open ...
> > > > For me it looks like we can deduce SlaveID from TaskID modulo we have
> > > to wait for transitionary slaves. If this is the case, providing just
> > > TaskID in Kill and Reconcile requests simplifies framework design and
> > > allows us to get rid of validating requests in master against
> {[SlaveID}}
> > > mismatch. Does this make sense?
> > >
> > >
> > > BTW, I checked the "Reconcile" message (see below) in scheduler.proto,
> > and
> > > found a field "statuses" is mentioned in its comments, however, I do
> not
> > > see such field in the "Reconcile" message, so I think the comments
> might
> > > not be correct, actually it should be "tasks" field?
> > >   // Allows the scheduler to query the status for non-terminal tasks.
> > >   // This causes the master to send back the latest task status for
> > >   // each task in 'tasks', if possible. Tasks that are no longer known
> > >   // will result in a TASK_LOST update. *If 'statuses' is empty*, then
> > >   // the master will send the latest status for each task currently
> > >   // known.
> > >   message Reconcile {
> > >    // TODO(vinod): Support arbitrary queries than just state of tasks.
> > >     message Task {
> > >       required TaskID task_id = 1;
> > >       optional AgentID agent_id = 2;
> > >     }
> > >
> > >     repeated Task tasks = 1;
> > >   }
> > >
> > >
> > > Regards,
> > > Qian Zhang
> > >
> > > [image: Inactive hide details for Alex Rukletsov ---09/15/2015
> > > 20:52:26---I asked the same question some time ago and got a good
> > explan]Alex
> > > Rukletsov ---09/15/2015 20:52:26---I asked the same question some time
> > ago
> > > and got a good explanation from Ben Mahler. Take a look at l
> > >
> > > From: Alex Rukletsov <a...@mesosphere.com>
> > > To: dev <dev@mesos.apache.org>
> > > Date: 09/15/2015 20:52
> > > Subject: Re: Why do we need slave_id in Kill message
> > > ------------------------------
> > >
> > >
> > >
> > > I asked the same question some time ago and got a good explanation from
> > Ben
> > > Mahler. Take a look at last comments in MESOS-1127
> > > <https://issues.apache.org/jira/browse/MESOS-1127> and maybe even
> > comments
> > > in review requests.
> > >
> > > Since the same question comes (at least) for the second time, maybe it
> > > makes sense to persist the answer somewhere (a comment in the
> protobuf).
> > >
> > > On Tue, Sep 15, 2015 at 11:55 AM, Klaus Ma <kl...@cguru.net> wrote:
> > >
> > > > I think this slave_id is used for status sync up/double check. In
> > master,
> > > > it'll check whether the special slave_id is equal to task's slave id;
> > if
> > > > not equal, master log message and ignore kill request.
> > > >
> > > >
> > > > On 2015年09月15日 17:46, Qian AZ Zhang wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> In Kill message (scheduler.proto), I found there is a slave_id
> field:
> > > >>    message Kill {
> > > >>      required TaskID task_id = 1;
> > > >>      optional SlaveID slave_id = 2;
> > > >>    }
> > > >>
> > > >> I am just wondering in which case framework needs to specify this
> > field
> > > >> when it kills a task, I think master should know the slave id of
> each
> > > >> task,
> > > >> can we just use the info in master?
> > > >>
> > > >>
> > > >> Regards,
> > > >> Qian Zhang
> > > >>
> > > >
> > > > --
> > > > Klaus Ma (马达), PMP® | http://www.cguru.net
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: Why do we need slave_id in Kill message

Reply via email to