Niklas Quarfot Nielsen created MESOS-1817:
---------------------------------------------
Summary: Completed tasks remains in TASK_RUNNING when framework is
disconnected
Key: MESOS-1817
URL: https://issues.apache.org/jira/browse/MESOS-1817
Project: Mesos
Issue Type: Bug
Reporter: Niklas Quarfot Nielsen
We have run into a problem that cause tasks which completes, when a framework
is disconnected and has a fail-over time, to remain in a running state even
though the tasks actually finishes. This hogs the cluster and gives users a
inconsistent view of the cluster state. Going to the slave, the task is
finished. Going to the master, the task is still in a non-terminal state. When
the scheduler reattaches or the failover timeout expires, the tasks finishes
correctly. The current workflow of this scheduler has a long fail-over timeout,
but may on the other hand never reattach.
Here is a test framework we have been able to reproduce the issue with:
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the
framework instance, the master reports the tasks as running even after several
minutes:
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
When clicking on one of the slaves where, for example, task 49 runs; the slave
knows that it completed:
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck
in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update
manage on the slaves will keep trying to send the front of status update stream
to the master (which in turn forwards it to the framework). If the first status
update after the disconnect is terminal, things work out fine; the master pick
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master
will never know that the task finished (or failed) before the framework
reconnects.
During a discussion on the dev mailing list
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
we enumerated a couple of options to solve this problem.
First off, having two ack-cycles: one between masters and slaves and one
between masters and frameworks, would be ideal. We would be able to replay the
statuses in order while keeping the master state current. However, this
requires us to persist the master state in a replicated storage.
As a first pass, we can make sure that the tasks caught in a running state
doesn't hog the cluster when completed and the framework being disconnected.
Here is a proof-of-concept to work out of:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/
A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
Which makes it possible for the status update manager to set the field, if the
latest status was terminal:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before
bringing it for up review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)