[
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niklas Quarfot Nielsen updated MESOS-1817:
------------------------------------------
Description:
We have run into a problem that cause tasks which completes, when a framework
is disconnected and has a fail-over time, to remain in a running state even
though the tasks actually finishes. This hogs the cluster and gives users a
inconsistent view of the cluster state. Going to the slave, the task is
finished. Going to the master, the task is still in a non-terminal state. When
the scheduler reattaches or the failover timeout expires, the tasks finishes
correctly. The current workflow of this scheduler has a long fail-over timeout,
but may on the other hand never reattach.
Here is a test framework we have been able to reproduce the issue with:
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the
framework instance, the master reports the tasks as running even after several
minutes:
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
When clicking on one of the slaves where, for example, task 49 runs; the slave
knows that it completed:
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck
in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update
manage on the slaves will keep trying to send the front of status update stream
to the master (which in turn forwards it to the framework). If the first status
update after the disconnect is terminal, things work out fine; the master pick
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master
will never know that the task finished (or failed) before the framework
reconnects.
During a discussion on the dev mailing list
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
we enumerated a couple of options to solve this problem.
First off, having two ack-cycles: one between masters and slaves and one
between masters and frameworks, would be ideal. We would be able to replay the
statuses in order while keeping the master state current. However, this
requires us to persist the master state in a replicated storage.
As a first pass, we can make sure that the tasks caught in a running state
doesn't hog the cluster when completed and the framework being disconnected.
Here is a proof-of-concept to work out of:
https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
Which makes it possible for the status update manager to set the field, if the
latest status was terminal:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before
bringing it for up review.
was:
We have run into a problem that cause tasks which completes, when a framework
is disconnected and has a fail-over time, to remain in a running state even
though the tasks actually finishes. This hogs the cluster and gives users a
inconsistent view of the cluster state. Going to the slave, the task is
finished. Going to the master, the task is still in a non-terminal state. When
the scheduler reattaches or the failover timeout expires, the tasks finishes
correctly. The current workflow of this scheduler has a long fail-over timeout,
but may on the other hand never reattach.
Here is a test framework we have been able to reproduce the issue with:
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the
framework instance, the master reports the tasks as running even after several
minutes:
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
When clicking on one of the slaves where, for example, task 49 runs; the slave
knows that it completed:
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck
in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update
manage on the slaves will keep trying to send the front of status update stream
to the master (which in turn forwards it to the framework). If the first status
update after the disconnect is terminal, things work out fine; the master pick
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master
will never know that the task finished (or failed) before the framework
reconnects.
During a discussion on the dev mailing list
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
we enumerated a couple of options to solve this problem.
First off, having two ack-cycles: one between masters and slaves and one
between masters and frameworks, would be ideal. We would be able to replay the
statuses in order while keeping the master state current. However, this
requires us to persist the master state in a replicated storage.
As a first pass, we can make sure that the tasks caught in a running state
doesn't hog the cluster when completed and the framework being disconnected.
Here is a proof-of-concept to work out of:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/
A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
Which makes it possible for the status update manager to set the field, if the
latest status was terminal:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before
bringing it for up review.
> Completed tasks remains in TASK_RUNNING when framework is disconnected
> ----------------------------------------------------------------------
>
> Key: MESOS-1817
> URL: https://issues.apache.org/jira/browse/MESOS-1817
> Project: Mesos
> Issue Type: Bug
> Reporter: Niklas Quarfot Nielsen
>
> We have run into a problem that cause tasks which completes, when a framework
> is disconnected and has a fail-over time, to remain in a running state even
> though the tasks actually finishes. This hogs the cluster and gives users a
> inconsistent view of the cluster state. Going to the slave, the task is
> finished. Going to the master, the task is still in a non-terminal state.
> When the scheduler reattaches or the failover timeout expires, the tasks
> finishes correctly. The current workflow of this scheduler has a long
> fail-over timeout, but may on the other hand never reattach.
> Here is a test framework we have been able to reproduce the issue with:
> https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the
> framework instance, the master reports the tasks as running even after
> several minutes:
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> When clicking on one of the slaves where, for example, task 49 runs; the
> slave knows that it completed:
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> Here is the log of a mesos-local instance where I reproduced it:
> https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are
> stuck in running state).
> There is a lot of output, so here is a filtered log for task 10:
> https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> The problem turn out to be an issue with the ack-cycle of status updates:
> If the framework disconnects (with a failover timeout set), the status update
> manage on the slaves will keep trying to send the front of status update
> stream to the master (which in turn forwards it to the framework). If the
> first status update after the disconnect is terminal, things work out fine;
> the master pick the terminal state up, removes the task and release the
> resources.
> If, on the other hand, one non-terminal status is in the stream. The master
> will never know that the task finished (or failed) before the framework
> reconnects.
> During a discussion on the dev mailing list
> (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
> we enumerated a couple of options to solve this problem.
> First off, having two ack-cycles: one between masters and slaves and one
> between masters and frameworks, would be ideal. We would be able to replay
> the statuses in order while keeping the master state current. However, this
> requires us to persist the master state in a replicated storage.
> As a first pass, we can make sure that the tasks caught in a running state
> doesn't hog the cluster when completed and the framework being disconnected.
> Here is a proof-of-concept to work out of:
> https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
> A new (optional) field have been added to the internal status update message:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
> Which makes it possible for the status update manager to set the field, if
> the latest status was terminal:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
> I added a test which should high-light the issue as well:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
> I would love some input on the approach before moving on.
> There are rough edges in the PoC which (of course) should be addressed before
> bringing it for up review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)