-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review182310
-----------------------------------------------------------



We talked about 2 approaches and approach 2 seemed like a cleaner way to 
address the issue.

Approach 1: ShutdownFrameworkMessage

1. Upon agent re-registration, master will add tasks even for non-PA frameworks 
on this agent. This is needed by the master to do correct resource accounting 
and not offer resources already in use on this agent. We need to mutate the 
TaskState on the Task before adding them to the master's data structures since 
the TaskState might be non-terminal when the agent sends these tasks with 
ReregisterSlaveMessage. And the master has already sent TASK_LOST for these 
tasks to the frameworks so we need to set the TaskState to TASK_LOST so that 
any future reconciliations with the framework doesn't have this task 
transitioning from TASK_LOST to TASK_RUNNNG/TASK_STARTING. This is to avoid 
unnecessary confusion about task state as observed by the framework but indeed 
this could have happened with non-strict registry as well where the framework 
can actually receive a non terminal task state update after receiving a 
TASK_LOST for the same task in the past.

2. When the agent re-registers, the master will continue to send a 
ShutdownFrameworkMessage to the agent to kill the tasks pertaining to non-PA 
frameworks on the agent as it does today. An additional optional field will be 
added to the ShutdownFrameworkMessage to indicate whether or not the shutdown 
was initiated internally.

3. During framework shutdown the state of the framework is set to 
Framework::TERMINATING which prevents it from launching new tasks. Here, since 
the framework is not really terminating so in order to allow it to launch new 
tasks, the agent will not set the state to terminating if the 
ShutdownFrameworkMessage is generated internally.

4. The framework shutdown today doesn't generate any status updates which needs 
to change. The status updates will be sent if the framework shutdown is 
triggered internally, this is needed to remove the tasks of non-PA frameworks 
that got added when the agent re-registered (1).

Approach 2: Do not shutdown non-PA framework when agent re-registers and let 
the frameworks make the decision on what needs to be done when they receive 
non-terminal status updates for tasks for which they have already received a 
TASK_LOST. This hopefully won't break any frameworks since it could have 
happened in the past with non-strict registry as well and frameworks should be 
resilient enough to handle this scenario.

Let me know if I have missed anything

- Megha Sharma


On Aug. 7, 2017, 6:23 p.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> -----------------------------------------------------------
> 
> (Updated Aug. 7, 2017, 6:23 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
>   src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
>   src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
>   src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Reply via email to