> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote: > > Did we get to a conclusion regarding case 1)? and could we write a test > > which exercises the new scenarios? > > Brenden Matthews wrote: > If I get some time, I'll write a test. I've been testing it in > production for a few days though. > > Not sure about consensus. Would like to hear from the others. > > Benjamin Hindman wrote: > Regarding Case 1, is the framework not receiving the status updates from > the slave? That seems more severe. When we added reconcileTasks we > specifically decided that we would not send status updates for all possible > tasks precisely because we could get into some incorrect situations. > > Regarding Case 2, why is a framework losing track of running tasks? > That's either a bug in the framework or it isn't keeping track of tasks in > the first place. Maybe we need a different API call that returns the list of > tasks and statuses that the master knows about?
The original problem I tried to solve with this actually turned out to be caused by a bug in marathon ( https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4 ). That said, it seems confusing to me that the reconciliation wouldn't include updates for tasks which either the master or the framework don't know about. I'm fine with also having a separate API call. What about using the status timestamps to avoid some of the incorrect situations? - Brenden ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15745/#review29305 ----------------------------------------------------------- On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/15745/ > ----------------------------------------------------------- > > (Updated Nov. 22, 2013, 12:30 a.m.) > > > Review request for mesos and Niklas Nielsen. > > > Repository: mesos-git > > > Description > ------- > > Fixed some task reconciliation cases. > > Case 1: > > If a slave is known but the task cannot be found, we should assume that > the task has been lost. It's possible that the following events > occurred: > > 1) Framework disconnected from master > 2) Master terminated framework's tasks > 3) Framework reconnects to master, and (incorrectly) assumes tasks are > still running > > Case 2: > > If a framework loses track of running tasks, the master should inform > the framework of which tasks it knows to be running, in addition to any > which have had a state change. > > Review: https://reviews.apache.org/r/15745 > > > Diffs > ----- > > src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 > > Diff: https://reviews.apache.org/r/15745/diff/ > > > Testing > ------- > > `make check` & tested in staging cluster. > > > Thanks, > > Brenden Matthews > >
