[ 
https://issues.apache.org/jira/browse/MESOS-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805365#comment-16805365
 ] 

Chun-Hung Hsiao edited comment on MESOS-9667 at 3/29/19 8:06 PM:
-----------------------------------------------------------------

Let's consider the following scenario:

 # The agent receives {{RunTaskGroupMessage}} with two tasks {{foo}} and 
{{bar}} using RP resources, and launch an executor.
 # Upon executor subscription, the agent performs the following steps to launch 
the task:
    2.1 Publish all resources *allocated to the executor* (i.e., including 
queued tasks).
    2.2 Ask the containerizer to update the executor container with all 
resources allocated to the executor.
    2.3 Send a {{LAUNCH_GROUP}} event containing tasks {{foo}} and {{bar}} to 
the executor.
 # The executor launches task {{foo}} through {{LAUNCH_NESTED_CONTAINER}}.
 # The agent receives {{TASK_STARTING}} for {{foo}} and dequeues the task.
 # The executor launches task {{bar}} through {{LAUNCH_NESTED_CONTAINER}}, 
which races with an agent failover.
 # The agent restarts and receives an executor resubscription.
 # Upon executor resubscription, the agent "recovers" the executor through the 
following steps:
    7.1 Publish all resources *allocated to the executor*.
    7.2 Ask the containerizer to update the executor container with all 
resources allocated to the executor.
    7.3 Send a {{LAUNCH_GROUP}} event containing the pending task {{bar}} to 
the executor.
 # The executor launches task {{bar}} through {{LAUNCH_NESTED_CONTAINER}}.

The problem described in this ticket is that Step 7.1 would crash if the agent 
hasn't reregistered yet (and thus the RP manager is not initialized). However, 
the actual problem to me is broader than just RP manager initialization. 
Essentially, there will be a period of time before the RP resubscribes that 
_the resources allocated to the executor is not contained in the agent's total 
resources!_

We have a couple options here:

* *Initialize the RP manager as early as possible*
  Say if we initialize the RP manager when the agent recovers its ID from the 
checkpointed state, the CHECK failure would be gone. But if the executor 
resubscribes before the RP does, the agent would fail the executor and 
transition all tasks, including the running task {{foo}}, to {{TASK_GONE}}. The 
race in Step 5 should not be a problem, but we need to verify that.

* *Block executor reregistration until agent recovery*
  This is basically similar to the above option, but I'm not sure if there's 
any concern w.r.t. agent recovery. IMO this is inferior to the above option.

* *Publish resources when receiving {{RunTaskGroupMessages}} and remove Step 
2.1 and 7.1*
  The idea is that we can have all resources ready before the executor knows 
about the tasks, so no resource publishing is required afterward, event after 
agent restarts.

Also, judging from the master code, it seems okay if we recover allocated 
resources in the master before RP subscriptions: resources not in an agent's 
total resources won't be considered available and thus won't be offered.


was (Author: chhsia0):
Let's consider the following scenario:

 # The agent receives {{RunTaskGroupMessage}} with two tasks {{foo}} and 
{{bar}} using RP resources, and launch an executor.
 # Upon executor subscription, the agent performs the following steps to launch 
the task:
    2.1 Publish all resources *allocated to the executor* (i.e., including 
queued tasks).
    2.2 Ask the containerizer to update the executor container with all 
resources allocated to the executor.
    2.3 Send a {{LAUNCH_GROUP}} event containing tasks {{foo}} and {{bar}} to 
the executor.
 # The executor launches task {{foo}} through {{LAUNCH_NESTED_CONTAINER}}.
 # The agent receives {{TASK_STARTING}} for {{foo}} and dequeues the task.
 # The executor launches task {{bar}} through {{LAUNCH_NESTED_CONTAINER}}, 
which races with an agent failover.
 # The agent restarts and receives an executor resubscription.
 # Upon executor resubscription, the agent "recovers" the executor through the 
following steps:
    7.1 Publish all resources *allocated to the executor*.
    7.2 Ask the containerizer to update the executor container with all 
resources allocated to the executor.
    7.3 Send a {{LAUNCH_GROUP}} event containing the pending task {{bar}} to 
the executor.
 # The executor launches task {{bar}} through {{LAUNCH_NESTED_CONTAINER}}.

The problem described in this ticket is that Step 7.1 would crash if the agent 
hasn't reregistered yet (and thus the RP manager is not initialized). However, 
the actual problem to me is broader than just RP manager initialization. 
Essentially, there will be a period of time before the RP resubscribes that 
_the resources allocated to the executor is not contained in the agent's total 
resources!_

We have a couple options here:

* *Initialize the RP manager as early as possible*
  Say if we initialize the RP manager when the agent recovers its ID from the 
checkpointed state, the CHECK failure would be gone. But if the executor 
resubscribes before the RP does, the agent would fail the executor and 
transition all tasks, including the running task {{foo}}, to {{TASK_GONE}}. The 
race in Step 5 should not be a problem, but we need to verify that.

* *Block executor reregistration until agent recovery*
  This is basically similar to the above option, but I'm not sure if there's 
any concern w.r.t. agent recovery. IMO this is inferior to the above option.

* *Publish allocated resources before Step 2.3 and 7.3 and remove Step 2.1 and 
7.1*
  The idea here is that since task {{foo}} is already running, the RP resources 
must have been ready, so it's really not necessary to publish the resources 
again. Only task {{bar}} would fail if the RP is not subscribed before Step 
7.3. Here we could either fail the resource publishing if the RP manager is not 
ready in Step 7.3, or initialize the RP manager early.

Also, judging from the master code, it seems okay if we recover allocated 
resources in the master before RP subscriptions: resources not in an agent's 
total resources won't be considered available and thus won't be offered.

> Check failure when executor for task using resource provider resources 
> subscribes before agent is registered
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9667
>                 URL: https://issues.apache.org/jira/browse/MESOS-9667
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.8.0
>            Reporter: Benjamin Bannier
>            Priority: Blocker
>              Labels: foundations, mesosphere, mesosphere-dss-ga
>
> When an executor for a task using resource provider resources subscribes 
> before the agent has registered with the master, we trigger a fatal assertion,
> {code:java}
> Mar 21 13:42:47 agent1 mesos-agent[17277]: F0321 13:42:46.845535 17295 
> slave.cpp:8834] Check failed: 'resourceProviderManager.get()' Must be non NULL
> Mar 21 13:42:47 agent1 mesos-agent[17277]: *** Check failure stack trace: 
> *{code}
> The reason for this failure is that we attempt to publish resources to the 
> resource provider via the resource provider manager, but the resource 
> provider manager is only created once the agent has registered with the 
> master.
> As a workaround one can terminate the executors and their tasks, and let the 
> framework relaunch the tasks (provided it supports that).
> A possible workaround could be to prevent such executors from subscribing 
> until the resource provider manager is available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to