Thanks for your input Ben! (Comments inlined) On 15 September 2014 12:35, Benjamin Mahler <[email protected]> wrote:
> To ensure that the architecture of mesos remains a scalable one, we want to > persist state in the leaves of the system as much as possible. This is why > the master has never persisted tasks, task states, or status updates. Note > that status updates can contain arbitrarily large amounts of data at the > current time (example impact of this: > https://issues.apache.org/jira/browse/MESOS-1746). > > Adam, I think solution (2) you listed above of including a terminal > indicator in the _private_ message between slave and master would easily > allow us to release the resources at the correct time. We would still hold > the correct task state in the master, and would maintain the status update > stream invariants for frameworks (guaranteed ordered delivery). This would > be simpler to implement with my recent change here, because you no longer > have to remove the task to release the resources: > https://reviews.apache.org/r/25568/ "We should still hold the correct task state"; meaning the actual state of the task on the slave? Then the auxiliary field should represent "last known" status, which may not necessarily be terminal. For example, a staging update followed by a running update while the framework is disconnected will show as staging still - or am I missing something? > > > Longer term, adding pipelining of status updates would be a nice > improvement (similar to what you listed in (1) above). But as Vinod said, > it will require care to ensure we maintain the stream invariants for > frameworks (i.e. probably need to send multiple updates in 1 message). > > How does this sound? > Sounds great - we would love to help out with (2). Would you be up for shepherding such a change? > > On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon <[email protected]> > wrote: > > > Definitely relevant. If the master could be trusted to persist all the > > task status updates, then they could be queued up at the master instead > of > > the slave once the master has acknowledged its receipt. Then the master > > could have the most up-to-date task state and can recover the resources > as > > soon as it receives a terminal update, even if the framework is > > disconnected and unable to receive/ack the status updates. Then, once the > > framework reconnects, the master will be responsible for sending its > queued > > status updates. We will still need a queue on the slave side, but only > for > > updates that the master has not persisted and ack'ed, primarily during > the > > scenario when the slave is disconnected from the master. > > > > On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <[email protected]> > wrote: > > > >> The semantics of these changes would have an impact on the upcoming task > >> reconciliation. > >> > >> @BenM: Can you chime in here on how this fits into the task > reconciliation > >> work that you've been leading? > >> > >> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <[email protected]> > >> wrote: > >> > >> > I agree with Niklas that if the executor has sent a terminal status > >> update > >> > to the slave, then the task is done and the master should be able to > >> > recover those resources. Only sending the oldest status update to the > >> > master, especially in the case of framework failover, prevents these > >> > resources from being recovered in a timely manner. I see a couple of > >> > options for getting around this, each with their own disadvantages. > >> > 1) Send the entire status update stream to the master. Once the master > >> sees > >> > the terminal status update, it will removeTask and recover the > >> resources. > >> > Future resends of the update will be forwarded to the scheduler, but > the > >> > master will ignore (with warning and invalid_update++ metrics) the > >> > subsequent updates as far as its own state for the removed task is > >> > concerned. Disadvantage: Potentially sends a lot of status update > >> messages > >> > until the scheduler reregisters and acknowledges the updates. > >> > Disadvantage2: Updates could be sent to the scheduler out of order if > >> some > >> > updates are dropped between the slave and master. > >> > 2) Send only the oldest status update to the master, but with an > >> annotation > >> > of the final/terminal state of the task, if any. That way the master > can > >> > call removeTask to update its internal state for the task (and update > >> the > >> > UI) and recover the resources for the task. While the scheduler is > still > >> > down, the oldest update will continue to be resent and forwarded, but > >> the > >> > master will ignore the update (with a warning as above) as far as its > >> own > >> > internal state is concerned. When the scheduler reregisters, the > update > >> > stream will be forwarded and acknowledged one-at-a-time as before, > >> > guaranteeing status update ordering to the scheduler. Disadvantage: > >> Seems a > >> > bit hacky to tack a terminal state onto a running update. > Disadvantage2: > >> > State endpoint won't show all the status updates until the entire > stream > >> > actually gets forwarded+acknowledged. > >> > Thoughts? > >> > > >> > > >> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <[email protected]> > >> wrote: > >> > > >> > > The main reason is to keep status update manager simple. Also, it is > >> very > >> > > easy to enforce the order of updates to the master/framework in this > >> > model. > >> > > If we allow multiple updates for a task to be in flight, it's really > >> hard > >> > > (impossible?) to ensure that we are not delivering out-of-order > >> updates > >> > > even in edge cases (failover, network partitions etc). > >> > > > >> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen < > [email protected] > >> > > >> > > wrote: > >> > > > >> > > > Hey Vinod - thanks for chiming in! > >> > > > > >> > > > Is there a particular reason for only having one status in flight? > >> Or > >> > to > >> > > > put it in another way, isn't that too strict behavior taken that > the > >> > > master > >> > > > state could present the most recent known state if the status > update > >> > > > manager tried to send more than the front of the stream? > >> > > > Taken very long timeouts, just waiting for those to disappear > seems > >> a > >> > bit > >> > > > tedious and hogs the cluster. > >> > > > > >> > > > Niklas > >> > > > > >> > > > On 10 September 2014 17:18, Vinod Kone <[email protected]> > wrote: > >> > > > > >> > > > > What you observed is expected because of the way the slave > >> > > (specifically, > >> > > > > the status update manager) operates. > >> > > > > > >> > > > > The status update manager only sends the next update for a task > >> if a > >> > > > > previous update (if it exists) has been acked. > >> > > > > > >> > > > > In your case, since TASK_RUNNING was not acked by the framework, > >> > master > >> > > > > doesn't know about the TASK_FINISHED update that is queued up by > >> the > >> > > > status > >> > > > > update manager. > >> > > > > > >> > > > > If the framework never comes back, i.e., failover timeout > elapses, > >> > > master > >> > > > > shuts down the framework, which releases those resources. > >> > > > > > >> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen < > >> > [email protected]> > >> > > > > wrote: > >> > > > > > >> > > > > > Here is the log of a mesos-local instance where I reproduced > it: > >> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task > 10 > >> to > >> > 19 > >> > > > are > >> > > > > > stuck in running state). > >> > > > > > There is a lot of output, so here is a filtered log for task > 10: > >> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d > >> > > > > > > >> > > > > > At first glance, it looks like the task can't be found when > >> trying > >> > to > >> > > > > > forward the finish update because the running update never got > >> > > > > acknowledged > >> > > > > > before the framework disconnected. I may be missing something > >> here. > >> > > > > > > >> > > > > > Niklas > >> > > > > > > >> > > > > > > >> > > > > > On 10 September 2014 16:09, Niklas Nielsen < > >> [email protected]> > >> > > > wrote: > >> > > > > > > >> > > > > > > Hi guys, > >> > > > > > > > >> > > > > > > We have run into a problem that cause tasks which completes, > >> > when a > >> > > > > > > framework is disconnected and has a fail-over time, to > remain > >> in > >> > a > >> > > > > > running > >> > > > > > > state even though the tasks actually finishes. > >> > > > > > > > >> > > > > > > Here is a test framework we have been able to reproduce the > >> issue > >> > > > with: > >> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54 > >> > > > > > > It launches many short-lived tasks (1 second sleep) and when > >> > > killing > >> > > > > the > >> > > > > > > framework instance, the master reports the tasks as running > >> even > >> > > > after > >> > > > > > > several minutes: > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png > >> > > > > > > > >> > > > > > > When clicking on one of the slaves where, for example, task > 49 > >> > > runs; > >> > > > > the > >> > > > > > > slave knows that it completed: > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png > >> > > > > > > > >> > > > > > > The tasks only finish when the framework connects again > >> (which it > >> > > may > >> > > > > > > never do). This is on Mesos 0.20.0, but also applies to HEAD > >> (as > >> > of > >> > > > > > today). > >> > > > > > > Do you guys have any insights into what may be going on > here? > >> Is > >> > > this > >> > > > > > > by-design or a bug? > >> > > > > > > > >> > > > > > > Thanks, > >> > > > > > > Niklas > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >
