Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen <[email protected]> wrote: > Hi guys, > > We have run into a problem that cause tasks which completes, when a > framework is disconnected and has a fail-over time, to remain in a running > state even though the tasks actually finishes. > > Here is a test framework we have been able to reproduce the issue with: > https://gist.github.com/nqn/9b9b1de9123a6e836f54 > It launches many short-lived tasks (1 second sleep) and when killing the > framework instance, the master reports the tasks as running even after > several minutes: > http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png > > When clicking on one of the slaves where, for example, task 49 runs; the > slave knows that it completed: > http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png > > The tasks only finish when the framework connects again (which it may > never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). > Do you guys have any insights into what may be going on here? Is this > by-design or a bug? > > Thanks, > Niklas >
