[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655590#comment-15655590
 ] 

Megha edited comment on MESOS-6223 at 11/15/16 3:39 PM:
--------------------------------------------------------

[~neilc]
Here, I am analyzing the impact of allowing agent to recover post reboot in the 
context of partition awareness. In my understanding there is no new transition 
which is not already happening with partition awareness. Do you think there 
could be a risk involved in allowing the recovery post reboots.

1. If there are no partition-aware frameworks on the agent: Agent while 
rebooting could either be disconnected or may fail the master health check 
timeout. The executors don't re-register as they have exited because of the 
reboot. Agent re-registers and starts to send status updates for unacked 
updates. From the framework's point of view the transition is simply 
TASK_STARTING/TASK_RUNNING -> TASK_LOST.

2. If there are tasks from partition aware frameworks on the agent: 
    a. The transition is same as above if the agent is disconnected.
    b. If the agent is marked unreachable while it was rebooting then from the 
framework's point of view, the tasks transition   from TASK_UNREACHABLE -> 
TASK_GONE when the agent re-registers and send status updates. Since the 
unreachable agents are in registry so master will remember them across its 
failovers so if the agent doesn't come back then frameworks will receive 
TASK_UNREACHABLE update upon reconciliation unless the registry is purged.
    c. If the agent is marked gone then the master is going to send 
TASK_GONE_BY_OPERATOR and if such an agent doesn't come back then future 
framework reconciliations will result in TASK_UNKNOWN status update since these 
there is no gone registry so the agents won't be remembered across master 
failovers. And if the agent eventually comes back then the task could 
transition from TASK_UNKNOWN back to TASK_GONE.



was (Author: megha.sharma):
[~neilc]
Here, I am analyzing the impact of allowing agent to recover post reboot in the 
context of partition awareness. In my understanding there is no new transition 
which is not already happening with partition awareness. Do you think there 
could be a risk involved in allowing the recovery post reboots.

1. If there are no partition-aware frameworks on the agent: Agent while 
rebooting could either be disconnected or may fail the master health check 
timeout. The executors don't re-register as they have exited because of the 
reboot. Agent re-registers and starts to send status updates for unacked 
updates. From the framework's point of view the transition is simply 
TASK_STARTING/TASK_RUNNING -> TASK_LOST.

2. If there are tasks from partition aware frameworks on the agent: 
    a. The transition is same as above if the agent is disconnected.
    b. If the agent is marked unreachable while it was rebooting then from the 
framework's point of view, the tasks transition   from TASK_UNREACHABLE -> 
TASK_GONE when the agent re-registers and send status updates. Since the 
unreachable agents are in registry so master will remember them across its 
failovers so if the agent doesn't come back then frameworks will receive 
TASK_UNREACHABLE update upon reconciliation unless the registry is purged.
    c. If the agent is marked gone then the master sends TASK_GONE and if such 
an agent doesn't come back then future framework reconciliations will result in 
TASK_UNKNOWN status update since these there is no gone registry so the agents 
won't be remembered across master failovers. And if the agent eventually comes 
back then the task could transition from TASK_UNKNOWN back to TASK_GONE.


> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
>                 Key: MESOS-6223
>                 URL: https://issues.apache.org/jira/browse/MESOS-6223
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Megha
>            Assignee: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to