[
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695160#comment-16695160
]
Chun-Hung Hsiao commented on MESOS-8400:
----------------------------------------
Thought dumps:
This can be tackled in either ways:
1. Adding a retry logic with an exponential backoff in the
{{StorageLocalResourceProviderProcess::call}} method.
2. Fail the resource provider and simply rely on MESOS-9223 to restart a new
instance. Pros and cons:
* + SLRP no longer needs to manage its container daemon, just do a launch and
fail itself if {{Probe}} fails. In the future we may want an external
orchestrator, e.g., Marathon, to manage the lifecycle of a local resource
provider, to enable features like rolling upgrade. To achieve this, we can add
a very simple relaunch policy into the default executor, and make it
responsible to relaunch the SLRP pod containing an SLRP task and a CSI task
upon failure.
* - A failure would lead to RP reregistration, therefore multiple
{{UpdateSlaveMessage}}s.
In that future vision, 1 might still be needed if the relaunch policy is on a
per-task basis instead of a per-pod basis.
So we can go for 1 for now, and do the remaining refactoring in the future.
> Retry logic for CSI calls when plugin crashes
> ---------------------------------------------
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
> Issue Type: Improvement
> Reporter: Chun-Hung Hsiao
> Assignee: Chun-Hung Hsiao
> Priority: Critical
> Labels: mesosphere, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its
> corresponding {{csi::Client}} service future. However, if there is a racy CSI
> call, the call may be issued before the future is reset, resulting in a
> failure for that CSI call. This could be avoided by introducing a retry
> logic. The following lists two possibilities:
> 1. If a GRPC channel can continue to work after its underlying domain socket
> is unbinded, removed and binded with the same filename (but different fd)
> again, then we can consider implementing the retry logic in `csi::Client`.
> The downside is that the racy call would go to the old future and all
> succeeding calls would go to the new future set up by the container daemon.
> 2. If the GRPC channel is bound to the domain socket fd, then we need to
> implement the retry logic in SLRP.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)