[
https://issues.apache.org/jira/browse/MESOS-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805529#comment-16805529
]
Chun-Hung Hsiao commented on MESOS-9667:
----------------------------------------
Some more thoughts after discussing with [~vinodkone] and [~greggomann]:
1. Initialize the RP manager as early as possible.
2. Maybe we can consider change {{publishResources}} here:
https://github.com/apache/mesos/blob/7c8a9a9218b5b3a9a2acbf8c10899355773377ef/src/slave/slave.cpp#L5027
to only do {{publishResources}} only if it's a fresh executor launch. We
could either check if {{queuedTasks}} is nonempty, or check if the slave is in
recovery state.
3. Or, refactor {{onUnscheduleGCFailure}} here:
https://github.com/apache/mesos/blob/7c8a9a9218b5b3a9a2acbf8c10899355773377ef/src/slave/slave.cpp#L2171
to handle task status update for general failure, then insert
{{publishResources}} right after {{unschedule}}, and remove
{{publishResources}} elsewhere.
We can see which of 2 and 3 leads to cleaner code.
> Check failure when executor for task using resource provider resources
> subscribes before agent is registered
> ------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-9667
> URL: https://issues.apache.org/jira/browse/MESOS-9667
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.8.0
> Reporter: Benjamin Bannier
> Assignee: Benjamin Bannier
> Priority: Blocker
> Labels: foundations, mesosphere, mesosphere-dss-ga
>
> When an executor for a task using resource provider resources subscribes
> before the agent has registered with the master, we trigger a fatal assertion,
> {code:java}
> Mar 21 13:42:47 agent1 mesos-agent[17277]: F0321 13:42:46.845535 17295
> slave.cpp:8834] Check failed: 'resourceProviderManager.get()' Must be non NULL
> Mar 21 13:42:47 agent1 mesos-agent[17277]: *** Check failure stack trace:
> *{code}
> The reason for this failure is that we attempt to publish resources to the
> resource provider via the resource provider manager, but the resource
> provider manager is only created once the agent has registered with the
> master.
> As a workaround one can terminate the executors and their tasks, and let the
> framework relaunch the tasks (provided it supports that).
> A possible workaround could be to prevent such executors from subscribing
> until the resource provider manager is available.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)