Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are
stuck in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

At first glance, it looks like the task can't be found when trying to
forward the finish update because the running update never got acknowledged
before the framework disconnected. I may be missing something here.

Niklas


On 10 September 2014 16:09, Niklas Nielsen <[email protected]> wrote:

> Hi guys,
>
> We have run into a problem that cause tasks which completes, when a
> framework is disconnected and has a fail-over time, to remain in a running
> state even though the tasks actually finishes.
>
> Here is a test framework we have been able to reproduce the issue with:
> https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the
> framework instance, the master reports the tasks as running even after
> several minutes:
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
>
> When clicking on one of the slaves where, for example, task 49 runs; the
> slave knows that it completed:
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
>
> The tasks only finish when the framework connects again (which it may
> never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today).
> Do you guys have any insights into what may be going on here? Is this
> by-design or a bug?
>
> Thanks,
> Niklas
>

Reply via email to