Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Niklas Nielsen Mon, 15 Sep 2014 15:12:21 -0700

Thanks for your input Ben! (Comments inlined)

On 15 September 2014 12:35, Benjamin Mahler <[email protected]>
wrote:


> To ensure that the architecture of mesos remains a scalable one, we want to
> persist state in the leaves of the system as much as possible. This is why
> the master has never persisted tasks, task states, or status updates. Note
> that status updates can contain arbitrarily large amounts of data at the
> current time (example impact of this:
> https://issues.apache.org/jira/browse/MESOS-1746).
>
> Adam, I think solution (2) you listed above of including a terminal
> indicator in the _private_ message between slave and master would easily
> allow us to release the resources at the correct time. We would still hold
> the correct task state in the master, and would maintain the status update
> stream invariants for frameworks (guaranteed ordered delivery). This would
> be simpler to implement with my recent change here, because you no longer
> have to remove the task to release the resources:
> https://reviews.apache.org/r/25568/


"We should still hold the correct task state"; meaning the actual state of
the task on the slave?
Then the auxiliary field should represent "last known" status, which may
not necessarily be terminal.
For example, a staging update followed by a running update while the
framework is disconnected will show as staging still - or am I missing
something?


>
>
> Longer term, adding pipelining of status updates would be a nice
> improvement (similar to what you listed in (1) above). But as Vinod said,
> it will require care to ensure we maintain the stream invariants for
> frameworks (i.e. probably need to send multiple updates in 1 message).
>
> How does this sound?
>

Sounds great - we would love to help out with (2). Would you be up for
shepherding such a change?


>
> On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon <[email protected]>
> wrote:
>
> > Definitely relevant. If the master could be trusted to persist all the
> > task status updates, then they could be queued up at the master instead
> of
> > the slave once the master has acknowledged its receipt. Then the master
> > could have the most up-to-date task state and can recover the resources
> as
> > soon as it receives a terminal update, even if the framework is
> > disconnected and unable to receive/ack the status updates. Then, once the
> > framework reconnects, the master will be responsible for sending its
> queued
> > status updates. We will still need a queue on the slave side, but only
> for
> > updates that the master has not persisted and ack'ed, primarily during
> the
> > scenario when the slave is disconnected from the master.
> >
> > On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <[email protected]>
> wrote:
> >
> >> The semantics of these changes would have an impact on the upcoming task
> >> reconciliation.
> >>
> >> @BenM: Can you chime in here on how this fits into the task
> reconciliation
> >> work that you've been leading?
> >>
> >> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <[email protected]>
> >> wrote:
> >>
> >> > I agree with Niklas that if the executor has sent a terminal status
> >> update
> >> > to the slave, then the task is done and the master should be able to
> >> > recover those resources. Only sending the oldest status update to the
> >> > master, especially in the case of framework failover, prevents these
> >> > resources from being recovered in a timely manner. I see a couple of
> >> > options for getting around this, each with their own disadvantages.
> >> > 1) Send the entire status update stream to the master. Once the master
> >> sees
> >> > the terminal status update, it will removeTask and recover the
> >> resources.
> >> > Future resends of the update will be forwarded to the scheduler, but
> the
> >> > master will ignore (with warning and invalid_update++ metrics) the
> >> > subsequent updates as far as its own state for the removed task is
> >> > concerned. Disadvantage: Potentially sends a lot of status update
> >> messages
> >> > until the scheduler reregisters and acknowledges the updates.
> >> > Disadvantage2: Updates could be sent to the scheduler out of order if
> >> some
> >> > updates are dropped between the slave and master.
> >> > 2) Send only the oldest status update to the master, but with an
> >> annotation
> >> > of the final/terminal state of the task, if any. That way the master
> can
> >> > call removeTask to update its internal state for the task (and update
> >> the
> >> > UI) and recover the resources for the task. While the scheduler is
> still
> >> > down, the oldest update will continue to be resent and forwarded, but
> >> the
> >> > master will ignore the update (with a warning as above) as far as its
> >> own
> >> > internal state is concerned. When the scheduler reregisters, the
> update
> >> > stream will be forwarded and acknowledged one-at-a-time as before,
> >> > guaranteeing status update ordering to the scheduler. Disadvantage:
> >> Seems a
> >> > bit hacky to tack a terminal state onto a running update.
> Disadvantage2:
> >> > State endpoint won't show all the status updates until the entire
> stream
> >> > actually gets forwarded+acknowledged.
> >> > Thoughts?
> >> >
> >> >
> >> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <[email protected]>
> >> wrote:
> >> >
> >> > > The main reason is to keep status update manager simple. Also, it is
> >> very
> >> > > easy to enforce the order of updates to the master/framework in this
> >> > model.
> >> > > If we allow multiple updates for a task to be in flight, it's really
> >> hard
> >> > > (impossible?) to ensure that we are not delivering out-of-order
> >> updates
> >> > > even in edge cases (failover, network partitions etc).
> >> > >
> >> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen <
> [email protected]
> >> >
> >> > > wrote:
> >> > >
> >> > > > Hey Vinod - thanks for chiming in!
> >> > > >
> >> > > > Is there a particular reason for only having one status in flight?
> >> Or
> >> > to
> >> > > > put it in another way, isn't that too strict behavior taken that
> the
> >> > > master
> >> > > > state could present the most recent known state if the status
> update
> >> > > > manager tried to send more than the front of the stream?
> >> > > > Taken very long timeouts, just waiting for those to disappear
> seems
> >> a
> >> > bit
> >> > > > tedious and hogs the cluster.
> >> > > >
> >> > > > Niklas
> >> > > >
> >> > > > On 10 September 2014 17:18, Vinod Kone <[email protected]>
> wrote:
> >> > > >
> >> > > > > What you observed is expected because of the way the slave
> >> > > (specifically,
> >> > > > > the status update manager) operates.
> >> > > > >
> >> > > > > The status update manager only sends the next update for a task
> >> if a
> >> > > > > previous update (if it exists) has been acked.
> >> > > > >
> >> > > > > In your case, since TASK_RUNNING was not acked by the framework,
> >> > master
> >> > > > > doesn't know about the TASK_FINISHED update that is queued up by
> >> the
> >> > > > status
> >> > > > > update manager.
> >> > > > >
> >> > > > > If the framework never comes back, i.e., failover timeout
> elapses,
> >> > > master
> >> > > > > shuts down the framework, which releases those resources.
> >> > > > >
> >> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen <
> >> > [email protected]>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Here is the log of a mesos-local instance where I reproduced
> it:
> >> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task
> 10
> >> to
> >> > 19
> >> > > > are
> >> > > > > > stuck in running state).
> >> > > > > > There is a lot of output, so here is a filtered log for task
> 10:
> >> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> >> > > > > >
> >> > > > > > At first glance, it looks like the task can't be found when
> >> trying
> >> > to
> >> > > > > > forward the finish update because the running update never got
> >> > > > > acknowledged
> >> > > > > > before the framework disconnected. I may be missing something
> >> here.
> >> > > > > >
> >> > > > > > Niklas
> >> > > > > >
> >> > > > > >
> >> > > > > > On 10 September 2014 16:09, Niklas Nielsen <
> >> [email protected]>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi guys,
> >> > > > > > >
> >> > > > > > > We have run into a problem that cause tasks which completes,
> >> > when a
> >> > > > > > > framework is disconnected and has a fail-over time, to
> remain
> >> in
> >> > a
> >> > > > > > running
> >> > > > > > > state even though the tasks actually finishes.
> >> > > > > > >
> >> > > > > > > Here is a test framework we have been able to reproduce the
> >> issue
> >> > > > with:
> >> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54
> >> > > > > > > It launches many short-lived tasks (1 second sleep) and when
> >> > > killing
> >> > > > > the
> >> > > > > > > framework instance, the master reports the tasks as running
> >> even
> >> > > > after
> >> > > > > > > several minutes:
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> >> > > > > > >
> >> > > > > > > When clicking on one of the slaves where, for example, task
> 49
> >> > > runs;
> >> > > > > the
> >> > > > > > > slave knows that it completed:
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> >> > > > > > >
> >> > > > > > > The tasks only finish when the framework connects again
> >> (which it
> >> > > may
> >> > > > > > > never do). This is on Mesos 0.20.0, but also applies to HEAD
> >> (as
> >> > of
> >> > > > > > today).
> >> > > > > > > Do you guys have any insights into what may be going on
> here?
> >> Is
> >> > > this
> >> > > > > > > by-design or a bug?
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > > Niklas
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Reply via email to