On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen <[email protected]> wrote:
> Thanks for your input Ben! (Comments inlined) > > On 15 September 2014 12:35, Benjamin Mahler <[email protected]> > wrote: > > > To ensure that the architecture of mesos remains a scalable one, we want > to > > persist state in the leaves of the system as much as possible. This is > why > > the master has never persisted tasks, task states, or status updates. > Note > > that status updates can contain arbitrarily large amounts of data at the > > current time (example impact of this: > > https://issues.apache.org/jira/browse/MESOS-1746). > > > > Adam, I think solution (2) you listed above of including a terminal > > indicator in the _private_ message between slave and master would easily > > allow us to release the resources at the correct time. We would still > hold > > the correct task state in the master, and would maintain the status > update > > stream invariants for frameworks (guaranteed ordered delivery). This > would > > be simpler to implement with my recent change here, because you no longer > > have to remove the task to release the resources: > > https://reviews.apache.org/r/25568/ > > > "We should still hold the correct task state"; meaning the actual state of > the task on the slave? > Correct, as in, just maintaining the existing task state semantics in the master (for correctness of reconciliation / status update ordering). > Then the auxiliary field should represent "last known" status, which may > not necessarily be terminal. > For example, a staging update followed by a running update while the > framework is disconnected will show as staging still - or am I missing > something? > It would still show as staging, the running update would never be sent to the master since the slave is not receiving acknowledgements. But if there were a terminal task, then the slave would be setting the auxiliary field in the StatusUpdateMessage and the master will know to release the resources. + backwards compatibility > > > > > > > > Longer term, adding pipelining of status updates would be a nice > > improvement (similar to what you listed in (1) above). But as Vinod said, > > it will require care to ensure we maintain the stream invariants for > > frameworks (i.e. probably need to send multiple updates in 1 message). > > > > How does this sound? > > > > Sounds great - we would love to help out with (2). Would you be up for > shepherding such a change? > Yep! The changes needed should be based off of https://reviews.apache.org/r/25568/ since it changes the resource releasing in the master. > > > > > > On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon <[email protected]> > > wrote: > > > > > Definitely relevant. If the master could be trusted to persist all the > > > task status updates, then they could be queued up at the master instead > > of > > > the slave once the master has acknowledged its receipt. Then the master > > > could have the most up-to-date task state and can recover the resources > > as > > > soon as it receives a terminal update, even if the framework is > > > disconnected and unable to receive/ack the status updates. Then, once > the > > > framework reconnects, the master will be responsible for sending its > > queued > > > status updates. We will still need a queue on the slave side, but only > > for > > > updates that the master has not persisted and ack'ed, primarily during > > the > > > scenario when the slave is disconnected from the master. > > > > > > On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <[email protected]> > > wrote: > > > > > >> The semantics of these changes would have an impact on the upcoming > task > > >> reconciliation. > > >> > > >> @BenM: Can you chime in here on how this fits into the task > > reconciliation > > >> work that you've been leading? > > >> > > >> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <[email protected]> > > >> wrote: > > >> > > >> > I agree with Niklas that if the executor has sent a terminal status > > >> update > > >> > to the slave, then the task is done and the master should be able to > > >> > recover those resources. Only sending the oldest status update to > the > > >> > master, especially in the case of framework failover, prevents these > > >> > resources from being recovered in a timely manner. I see a couple of > > >> > options for getting around this, each with their own disadvantages. > > >> > 1) Send the entire status update stream to the master. Once the > master > > >> sees > > >> > the terminal status update, it will removeTask and recover the > > >> resources. > > >> > Future resends of the update will be forwarded to the scheduler, but > > the > > >> > master will ignore (with warning and invalid_update++ metrics) the > > >> > subsequent updates as far as its own state for the removed task is > > >> > concerned. Disadvantage: Potentially sends a lot of status update > > >> messages > > >> > until the scheduler reregisters and acknowledges the updates. > > >> > Disadvantage2: Updates could be sent to the scheduler out of order > if > > >> some > > >> > updates are dropped between the slave and master. > > >> > 2) Send only the oldest status update to the master, but with an > > >> annotation > > >> > of the final/terminal state of the task, if any. That way the master > > can > > >> > call removeTask to update its internal state for the task (and > update > > >> the > > >> > UI) and recover the resources for the task. While the scheduler is > > still > > >> > down, the oldest update will continue to be resent and forwarded, > but > > >> the > > >> > master will ignore the update (with a warning as above) as far as > its > > >> own > > >> > internal state is concerned. When the scheduler reregisters, the > > update > > >> > stream will be forwarded and acknowledged one-at-a-time as before, > > >> > guaranteeing status update ordering to the scheduler. Disadvantage: > > >> Seems a > > >> > bit hacky to tack a terminal state onto a running update. > > Disadvantage2: > > >> > State endpoint won't show all the status updates until the entire > > stream > > >> > actually gets forwarded+acknowledged. > > >> > Thoughts? > > >> > > > >> > > > >> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <[email protected]> > > >> wrote: > > >> > > > >> > > The main reason is to keep status update manager simple. Also, it > is > > >> very > > >> > > easy to enforce the order of updates to the master/framework in > this > > >> > model. > > >> > > If we allow multiple updates for a task to be in flight, it's > really > > >> hard > > >> > > (impossible?) to ensure that we are not delivering out-of-order > > >> updates > > >> > > even in edge cases (failover, network partitions etc). > > >> > > > > >> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen < > > [email protected] > > >> > > > >> > > wrote: > > >> > > > > >> > > > Hey Vinod - thanks for chiming in! > > >> > > > > > >> > > > Is there a particular reason for only having one status in > flight? > > >> Or > > >> > to > > >> > > > put it in another way, isn't that too strict behavior taken that > > the > > >> > > master > > >> > > > state could present the most recent known state if the status > > update > > >> > > > manager tried to send more than the front of the stream? > > >> > > > Taken very long timeouts, just waiting for those to disappear > > seems > > >> a > > >> > bit > > >> > > > tedious and hogs the cluster. > > >> > > > > > >> > > > Niklas > > >> > > > > > >> > > > On 10 September 2014 17:18, Vinod Kone <[email protected]> > > wrote: > > >> > > > > > >> > > > > What you observed is expected because of the way the slave > > >> > > (specifically, > > >> > > > > the status update manager) operates. > > >> > > > > > > >> > > > > The status update manager only sends the next update for a > task > > >> if a > > >> > > > > previous update (if it exists) has been acked. > > >> > > > > > > >> > > > > In your case, since TASK_RUNNING was not acked by the > framework, > > >> > master > > >> > > > > doesn't know about the TASK_FINISHED update that is queued up > by > > >> the > > >> > > > status > > >> > > > > update manager. > > >> > > > > > > >> > > > > If the framework never comes back, i.e., failover timeout > > elapses, > > >> > > master > > >> > > > > shuts down the framework, which releases those resources. > > >> > > > > > > >> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen < > > >> > [email protected]> > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Here is the log of a mesos-local instance where I reproduced > > it: > > >> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task > > 10 > > >> to > > >> > 19 > > >> > > > are > > >> > > > > > stuck in running state). > > >> > > > > > There is a lot of output, so here is a filtered log for task > > 10: > > >> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d > > >> > > > > > > > >> > > > > > At first glance, it looks like the task can't be found when > > >> trying > > >> > to > > >> > > > > > forward the finish update because the running update never > got > > >> > > > > acknowledged > > >> > > > > > before the framework disconnected. I may be missing > something > > >> here. > > >> > > > > > > > >> > > > > > Niklas > > >> > > > > > > > >> > > > > > > > >> > > > > > On 10 September 2014 16:09, Niklas Nielsen < > > >> [email protected]> > > >> > > > wrote: > > >> > > > > > > > >> > > > > > > Hi guys, > > >> > > > > > > > > >> > > > > > > We have run into a problem that cause tasks which > completes, > > >> > when a > > >> > > > > > > framework is disconnected and has a fail-over time, to > > remain > > >> in > > >> > a > > >> > > > > > running > > >> > > > > > > state even though the tasks actually finishes. > > >> > > > > > > > > >> > > > > > > Here is a test framework we have been able to reproduce > the > > >> issue > > >> > > > with: > > >> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54 > > >> > > > > > > It launches many short-lived tasks (1 second sleep) and > when > > >> > > killing > > >> > > > > the > > >> > > > > > > framework instance, the master reports the tasks as > running > > >> even > > >> > > > after > > >> > > > > > > several minutes: > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png > > >> > > > > > > > > >> > > > > > > When clicking on one of the slaves where, for example, > task > > 49 > > >> > > runs; > > >> > > > > the > > >> > > > > > > slave knows that it completed: > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png > > >> > > > > > > > > >> > > > > > > The tasks only finish when the framework connects again > > >> (which it > > >> > > may > > >> > > > > > > never do). This is on Mesos 0.20.0, but also applies to > HEAD > > >> (as > > >> > of > > >> > > > > > today). > > >> > > > > > > Do you guys have any insights into what may be going on > > here? > > >> Is > > >> > > this > > >> > > > > > > by-design or a bug? > > >> > > > > > > > > >> > > > > > > Thanks, > > >> > > > > > > Niklas > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
