[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917584#comment-16917584 ]
Benjamin Bannier edited comment on MESOS-8400 at 9/4/19 1:51 PM: ----------------------------------------------------------------- Reviews: -[https://reviews.apache.org/r/71382]- -[https://reviews.apache.org/r/71383]- [https://reviews.apache.org/r/71384] [https://reviews.apache.org/r/71385] was (Author: bbannier): Reviews: -https://reviews.apache.org/r/71382- -[https://reviews.apache.org/r/71383]- [https://reviews.apache.org/r/71384] [https://reviews.apache.org/r/71385] > Handle plugin crashes gracefully in SLRP recovery. > -------------------------------------------------- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement > Reporter: Chun-Hung Hsiao > Assignee: Benjamin Bannier > Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.2#803003)