[
https://issues.apache.org/jira/browse/MESOS-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805461#comment-16805461
]
Greg Mann commented on MESOS-9667:
----------------------------------
[~chhsia0], regarding the third item in your list, I do think that we should
eliminate the {{publishResources()}} call in {{Slave::subscribe()}}. I think
it's fine if we end up failing task launches because an RP has not yet
subscribed with the agent - this will be an issue in any case, since a
{{RunTaskMessage}} could be received at any time which includes resources from
an RP that is not currently subscribed (perhaps the RP just recently
disconnected).
The above suggests one argument in favor of blocking executor reregistration
until agent recovery is complete: this would allow the RPs more time to
resubscribe, reducing the chances of an executor submitting a launch call for a
task which uses resources from an unsubscribed RP.
> Check failure when executor for task using resource provider resources
> subscribes before agent is registered
> ------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-9667
> URL: https://issues.apache.org/jira/browse/MESOS-9667
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.8.0
> Reporter: Benjamin Bannier
> Priority: Blocker
> Labels: foundations, mesosphere, mesosphere-dss-ga
>
> When an executor for a task using resource provider resources subscribes
> before the agent has registered with the master, we trigger a fatal assertion,
> {code:java}
> Mar 21 13:42:47 agent1 mesos-agent[17277]: F0321 13:42:46.845535 17295
> slave.cpp:8834] Check failed: 'resourceProviderManager.get()' Must be non NULL
> Mar 21 13:42:47 agent1 mesos-agent[17277]: *** Check failure stack trace:
> *{code}
> The reason for this failure is that we attempt to publish resources to the
> resource provider via the resource provider manager, but the resource
> provider manager is only created once the agent has registered with the
> master.
> As a workaround one can terminate the executors and their tasks, and let the
> framework relaunch the tasks (provided it supports that).
> A possible workaround could be to prevent such executors from subscribing
> until the resource provider manager is available.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)