[
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695138#comment-16695138
]
Chun-Hung Hsiao edited comment on MESOS-9223 at 11/21/18 7:18 PM:
------------------------------------------------------------------
There are to problems here:
1. How to make the agent more robust when handling SLRP failures.
2. How to surface the SLRP failures.
It seems to me 2 can be addressed by MESOS-8380.
Thought dumps for 1:
We can make the {{LocalResourceProviderDaemon}} act like systemd: retry
launching the SLRP when there is a launch failure, potentially with an
exponential backoff.
To achieve this, we'll need a way for the daemon to monitor SLRP failures.
Since the daemon is not aware of the resource provider manager (and should not
be aware of it for low coupling),
we could add the following virtual method to {{LocalResourceProvider}},
similar to {{ContainerDaemon::wait}}:
{noformat}
// Returns a future that only reaches a terminal state when a local resource
// provider is terminated. This is intended to capture any fatal error
// encountered by the resource provider.
virtual process::Future<Nothing> wait() = 0;
{noformat}
Then, retry to launch a new SLRP instance if a failed future is returned by
{{wait}}.
was (Author: chhsia0):
There are to problems here:
1. How to make the agent more robust when handling SLRP failures.
2. How to surface the SLRP failures.
It seems to me 2 can be addressed by MESOS-8380.
Thought dumps for 1:
We can make the {{LocalResourceProviderDaemon}} act like systemd: retry
launching the SLRP when there is a launch failure, potentially with an
exponential backoff.
> Storage local provider does not sufficiently handle container launch failures
> or errors
> ---------------------------------------------------------------------------------------
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
> Issue Type: Improvement
> Components: agent, storage
> Reporter: Benjamin Bannier
> Priority: Critical
>
> The storage local resource provider as currently implemented does not handle
> launch failures or task errors of its standalone containers well enough, If
> e.g., a RP container fails to come up during node start a warning would be
> logged, but an operator still needs to detect degraded functionality,
> manually check the state of containers with {{GET_CONTAINERS}}, and decide
> whether the agent needs restarting; I suspect they do not have always have
> enough context for this decision. It would be better if the provider would
> either enforce a restart by failing over the whole agent, or by retrying the
> operation (optionally: up to some maximum amount of retries).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)