[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695160#comment-16695160
 ] 

Chun-Hung Hsiao commented on MESOS-8400:
----------------------------------------

Thought dumps:

This can be tackled in either ways:
1. Adding a retry logic with an exponential backoff in the 
{{StorageLocalResourceProviderProcess::call}} method.
2. Fail the resource provider and simply rely on MESOS-9223 to restart a new 
instance. Pros and cons:
* + SLRP no longer needs to manage its container daemon, just do a launch and 
fail itself if {{Probe}} fails. In the future we may want an external 
orchestrator, e.g., Marathon, to manage the lifecycle of a local resource 
provider, to enable features like rolling upgrade. To achieve this, we can add 
a very simple relaunch policy into the default executor, and make it 
responsible to relaunch the SLRP pod containing an SLRP task and a CSI task 
upon failure.
* - A failure would lead to RP reregistration, therefore multiple 
{{UpdateSlaveMessage}}s.

In that future vision, 1 might still be needed if the relaunch policy is on a 
per-task basis instead of a per-pod basis.
So we can go for 1 for now, and do the remaining refactoring in the future.

> Retry logic for CSI calls when plugin crashes
> ---------------------------------------------
>
>                 Key: MESOS-8400
>                 URL: https://issues.apache.org/jira/browse/MESOS-8400
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Critical
>              Labels: mesosphere, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if there is a racy CSI 
> call, the call may be issued before the future is reset, resulting in a 
> failure for that CSI call. This could be avoided by introducing a retry 
> logic. The following lists two possibilities:
> 1. If a GRPC channel can continue to work after its underlying domain socket 
> is unbinded, removed and binded with the same filename (but different fd) 
> again, then we can consider implementing the retry logic in `csi::Client`. 
> The downside is that the racy call would go to the old future and all 
> succeeding calls would go to the new future set up by the container daemon.
> 2. If the GRPC channel is bound to the domain socket fd, then we need to 
> implement the retry logic in SLRP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to