On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen <[email protected]>
wrote:

> Thanks for your input Ben! (Comments inlined)
>
> On 15 September 2014 12:35, Benjamin Mahler <[email protected]>
> wrote:
>
> > To ensure that the architecture of mesos remains a scalable one, we want
> to
> > persist state in the leaves of the system as much as possible. This is
> why
> > the master has never persisted tasks, task states, or status updates.
> Note
> > that status updates can contain arbitrarily large amounts of data at the
> > current time (example impact of this:
> > https://issues.apache.org/jira/browse/MESOS-1746).
> >
> > Adam, I think solution (2) you listed above of including a terminal
> > indicator in the _private_ message between slave and master would easily
> > allow us to release the resources at the correct time. We would still
> hold
> > the correct task state in the master, and would maintain the status
> update
> > stream invariants for frameworks (guaranteed ordered delivery). This
> would
> > be simpler to implement with my recent change here, because you no longer
> > have to remove the task to release the resources:
> > https://reviews.apache.org/r/25568/
>
>
> "We should still hold the correct task state"; meaning the actual state of
> the task on the slave?
>

Correct, as in, just maintaining the existing task state semantics in the
master (for correctness of reconciliation / status update ordering).


> Then the auxiliary field should represent "last known" status, which may
> not necessarily be terminal.
> For example, a staging update followed by a running update while the
> framework is disconnected will show as staging still - or am I missing
> something?
>

It would still show as staging, the running update would never be sent to
the master since the slave is not receiving acknowledgements. But if there
were a terminal task, then the slave would be setting the auxiliary field
in the StatusUpdateMessage and the master will know to release the
resources.

+ backwards compatibility


>
>
> >
> >
> > Longer term, adding pipelining of status updates would be a nice
> > improvement (similar to what you listed in (1) above). But as Vinod said,
> > it will require care to ensure we maintain the stream invariants for
> > frameworks (i.e. probably need to send multiple updates in 1 message).
> >
> > How does this sound?
> >
>
> Sounds great - we would love to help out with (2). Would you be up for
> shepherding such a change?
>

Yep!

The changes needed should be based off of
https://reviews.apache.org/r/25568/ since it changes the resource releasing
in the master.


>
>
> >
> > On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon <[email protected]>
> > wrote:
> >
> > > Definitely relevant. If the master could be trusted to persist all the
> > > task status updates, then they could be queued up at the master instead
> > of
> > > the slave once the master has acknowledged its receipt. Then the master
> > > could have the most up-to-date task state and can recover the resources
> > as
> > > soon as it receives a terminal update, even if the framework is
> > > disconnected and unable to receive/ack the status updates. Then, once
> the
> > > framework reconnects, the master will be responsible for sending its
> > queued
> > > status updates. We will still need a queue on the slave side, but only
> > for
> > > updates that the master has not persisted and ack'ed, primarily during
> > the
> > > scenario when the slave is disconnected from the master.
> > >
> > > On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <[email protected]>
> > wrote:
> > >
> > >> The semantics of these changes would have an impact on the upcoming
> task
> > >> reconciliation.
> > >>
> > >> @BenM: Can you chime in here on how this fits into the task
> > reconciliation
> > >> work that you've been leading?
> > >>
> > >> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <[email protected]>
> > >> wrote:
> > >>
> > >> > I agree with Niklas that if the executor has sent a terminal status
> > >> update
> > >> > to the slave, then the task is done and the master should be able to
> > >> > recover those resources. Only sending the oldest status update to
> the
> > >> > master, especially in the case of framework failover, prevents these
> > >> > resources from being recovered in a timely manner. I see a couple of
> > >> > options for getting around this, each with their own disadvantages.
> > >> > 1) Send the entire status update stream to the master. Once the
> master
> > >> sees
> > >> > the terminal status update, it will removeTask and recover the
> > >> resources.
> > >> > Future resends of the update will be forwarded to the scheduler, but
> > the
> > >> > master will ignore (with warning and invalid_update++ metrics) the
> > >> > subsequent updates as far as its own state for the removed task is
> > >> > concerned. Disadvantage: Potentially sends a lot of status update
> > >> messages
> > >> > until the scheduler reregisters and acknowledges the updates.
> > >> > Disadvantage2: Updates could be sent to the scheduler out of order
> if
> > >> some
> > >> > updates are dropped between the slave and master.
> > >> > 2) Send only the oldest status update to the master, but with an
> > >> annotation
> > >> > of the final/terminal state of the task, if any. That way the master
> > can
> > >> > call removeTask to update its internal state for the task (and
> update
> > >> the
> > >> > UI) and recover the resources for the task. While the scheduler is
> > still
> > >> > down, the oldest update will continue to be resent and forwarded,
> but
> > >> the
> > >> > master will ignore the update (with a warning as above) as far as
> its
> > >> own
> > >> > internal state is concerned. When the scheduler reregisters, the
> > update
> > >> > stream will be forwarded and acknowledged one-at-a-time as before,
> > >> > guaranteeing status update ordering to the scheduler. Disadvantage:
> > >> Seems a
> > >> > bit hacky to tack a terminal state onto a running update.
> > Disadvantage2:
> > >> > State endpoint won't show all the status updates until the entire
> > stream
> > >> > actually gets forwarded+acknowledged.
> > >> > Thoughts?
> > >> >
> > >> >
> > >> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <[email protected]>
> > >> wrote:
> > >> >
> > >> > > The main reason is to keep status update manager simple. Also, it
> is
> > >> very
> > >> > > easy to enforce the order of updates to the master/framework in
> this
> > >> > model.
> > >> > > If we allow multiple updates for a task to be in flight, it's
> really
> > >> hard
> > >> > > (impossible?) to ensure that we are not delivering out-of-order
> > >> updates
> > >> > > even in edge cases (failover, network partitions etc).
> > >> > >
> > >> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen <
> > [email protected]
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hey Vinod - thanks for chiming in!
> > >> > > >
> > >> > > > Is there a particular reason for only having one status in
> flight?
> > >> Or
> > >> > to
> > >> > > > put it in another way, isn't that too strict behavior taken that
> > the
> > >> > > master
> > >> > > > state could present the most recent known state if the status
> > update
> > >> > > > manager tried to send more than the front of the stream?
> > >> > > > Taken very long timeouts, just waiting for those to disappear
> > seems
> > >> a
> > >> > bit
> > >> > > > tedious and hogs the cluster.
> > >> > > >
> > >> > > > Niklas
> > >> > > >
> > >> > > > On 10 September 2014 17:18, Vinod Kone <[email protected]>
> > wrote:
> > >> > > >
> > >> > > > > What you observed is expected because of the way the slave
> > >> > > (specifically,
> > >> > > > > the status update manager) operates.
> > >> > > > >
> > >> > > > > The status update manager only sends the next update for a
> task
> > >> if a
> > >> > > > > previous update (if it exists) has been acked.
> > >> > > > >
> > >> > > > > In your case, since TASK_RUNNING was not acked by the
> framework,
> > >> > master
> > >> > > > > doesn't know about the TASK_FINISHED update that is queued up
> by
> > >> the
> > >> > > > status
> > >> > > > > update manager.
> > >> > > > >
> > >> > > > > If the framework never comes back, i.e., failover timeout
> > elapses,
> > >> > > master
> > >> > > > > shuts down the framework, which releases those resources.
> > >> > > > >
> > >> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen <
> > >> > [email protected]>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Here is the log of a mesos-local instance where I reproduced
> > it:
> > >> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task
> > 10
> > >> to
> > >> > 19
> > >> > > > are
> > >> > > > > > stuck in running state).
> > >> > > > > > There is a lot of output, so here is a filtered log for task
> > 10:
> > >> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> > >> > > > > >
> > >> > > > > > At first glance, it looks like the task can't be found when
> > >> trying
> > >> > to
> > >> > > > > > forward the finish update because the running update never
> got
> > >> > > > > acknowledged
> > >> > > > > > before the framework disconnected. I may be missing
> something
> > >> here.
> > >> > > > > >
> > >> > > > > > Niklas
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On 10 September 2014 16:09, Niklas Nielsen <
> > >> [email protected]>
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi guys,
> > >> > > > > > >
> > >> > > > > > > We have run into a problem that cause tasks which
> completes,
> > >> > when a
> > >> > > > > > > framework is disconnected and has a fail-over time, to
> > remain
> > >> in
> > >> > a
> > >> > > > > > running
> > >> > > > > > > state even though the tasks actually finishes.
> > >> > > > > > >
> > >> > > > > > > Here is a test framework we have been able to reproduce
> the
> > >> issue
> > >> > > > with:
> > >> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54
> > >> > > > > > > It launches many short-lived tasks (1 second sleep) and
> when
> > >> > > killing
> > >> > > > > the
> > >> > > > > > > framework instance, the master reports the tasks as
> running
> > >> even
> > >> > > > after
> > >> > > > > > > several minutes:
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> > >> > > > > > >
> > >> > > > > > > When clicking on one of the slaves where, for example,
> task
> > 49
> > >> > > runs;
> > >> > > > > the
> > >> > > > > > > slave knows that it completed:
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> > >> > > > > > >
> > >> > > > > > > The tasks only finish when the framework connects again
> > >> (which it
> > >> > > may
> > >> > > > > > > never do). This is on Mesos 0.20.0, but also applies to
> HEAD
> > >> (as
> > >> > of
> > >> > > > > > today).
> > >> > > > > > > Do you guys have any insights into what may be going on
> > here?
> > >> Is
> > >> > > this
> > >> > > > > > > by-design or a bug?
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > > Niklas
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Reply via email to