[ 
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1817:
------------------------------------------
    Description: 
We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-local instance where I reproduced it: 
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck 
in running state).
There is a lot of output, so here is a filtered log for task 10: 
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update 
manage on the slaves will keep trying to send the front of status update stream 
to the master (which in turn forwards it to the framework). If the first status 
update after the disconnect is terminal, things work out fine; the master pick 
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master 
will never know that the task finished (or failed) before the framework 
reconnects.

During a discussion on the dev mailing list 
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
 we enumerated a couple of options to solve this problem.

First off, having two ack-cycles: one between masters and slaves and one 
between masters and frameworks, would be ideal. We would be able to replay the 
statuses in order while keeping the master state current. However, this 
requires us to persist the master state in a replicated storage.

As a first pass, we can make sure that the tasks caught in a running state 
doesn't hog the cluster when completed and the framework being disconnected.

Here is a proof-of-concept to work out of: 
https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/

A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68

Which makes it possible for the status update manager to set the field, if the 
latest status was terminal: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501

I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478

I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before 
bringing it for up review.

  was:
We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-local instance where I reproduced it: 
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck 
in running state).
There is a lot of output, so here is a filtered log for task 10: 
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update 
manage on the slaves will keep trying to send the front of status update stream 
to the master (which in turn forwards it to the framework). If the first status 
update after the disconnect is terminal, things work out fine; the master pick 
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master 
will never know that the task finished (or failed) before the framework 
reconnects.

During a discussion on the dev mailing list 
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
 we enumerated a couple of options to solve this problem.

First off, having two ack-cycles: one between masters and slaves and one 
between masters and frameworks, would be ideal. We would be able to replay the 
statuses in order while keeping the master state current. However, this 
requires us to persist the master state in a replicated storage.

As a first pass, we can make sure that the tasks caught in a running state 
doesn't hog the cluster when completed and the framework being disconnected.

Here is a proof-of-concept to work out of: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/

A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68

Which makes it possible for the status update manager to set the field, if the 
latest status was terminal: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501

I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478

I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before 
bringing it for up review.


> Completed tasks remains in TASK_RUNNING when framework is disconnected
> ----------------------------------------------------------------------
>
>                 Key: MESOS-1817
>                 URL: https://issues.apache.org/jira/browse/MESOS-1817
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Niklas Quarfot Nielsen
>
> We have run into a problem that cause tasks which completes, when a framework 
> is disconnected and has a fail-over time, to remain in a running state even 
> though the tasks actually finishes. This hogs the cluster and gives users a 
> inconsistent view of the cluster state. Going to the slave, the task is 
> finished. Going to the master, the task is still in a non-terminal state. 
> When the scheduler reattaches or the failover timeout expires, the tasks 
> finishes correctly. The current workflow of this scheduler has a long 
> fail-over timeout, but may on the other hand never reattach.
> Here is a test framework we have been able to reproduce the issue with: 
> https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the 
> framework instance, the master reports the tasks as running even after 
> several minutes: 
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> When clicking on one of the slaves where, for example, task 49 runs; the 
> slave knows that it completed: 
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> Here is the log of a mesos-local instance where I reproduced it: 
> https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are 
> stuck in running state).
> There is a lot of output, so here is a filtered log for task 10: 
> https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> The problem turn out to be an issue with the ack-cycle of status updates:
> If the framework disconnects (with a failover timeout set), the status update 
> manage on the slaves will keep trying to send the front of status update 
> stream to the master (which in turn forwards it to the framework). If the 
> first status update after the disconnect is terminal, things work out fine; 
> the master pick the terminal state up, removes the task and release the 
> resources.
> If, on the other hand, one non-terminal status is in the stream. The master 
> will never know that the task finished (or failed) before the framework 
> reconnects.
> During a discussion on the dev mailing list 
> (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
>  we enumerated a couple of options to solve this problem.
> First off, having two ack-cycles: one between masters and slaves and one 
> between masters and frameworks, would be ideal. We would be able to replay 
> the statuses in order while keeping the master state current. However, this 
> requires us to persist the master state in a replicated storage.
> As a first pass, we can make sure that the tasks caught in a running state 
> doesn't hog the cluster when completed and the framework being disconnected.
> Here is a proof-of-concept to work out of: 
> https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
> A new (optional) field have been added to the internal status update message:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
> Which makes it possible for the status update manager to set the field, if 
> the latest status was terminal: 
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
> I added a test which should high-light the issue as well:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
> I would love some input on the approach before moving on.
> There are rough edges in the PoC which (of course) should be addressed before 
> bringing it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to