> On Aug. 7, 2020, 6:04 a.m., Greg Mann wrote: > > src/slave/containerizer/mesos/isolators/volume/csi/isolator.cpp > > Lines 383-387 (patched) > > <https://reviews.apache.org/r/72734/diff/1/?file=2237067#file2237067line383> > > > > After looking at the code, it seems that a failure here would surface > > as a TASK_FAILED status update to the scheduler due to a failed container > > termination, is that your understanding as well? > > > > I'm wondering if we need to add some extra handling for the failed > > unpublish case, either here or in the Mesos containerizer somehow. What > > action needs to be taken in the case where an unpublish call to the CSI > > plugin fails for some transient reason?
> After looking at the code, it seems that a failure here would surface as a > TASK_FAILED status update to the scheduler due to a failed container > termination, is that your understanding as well? Yes. > I'm wondering if we need to add some extra handling for the failed unpublish > case, either here or in the Mesos containerizer somehow. What action needs to > be taken in the case where an unpublish call to the CSI plugin fails for some > transient reason? Retry the unpublish call? But how can we know whether it fails for a **transient** reason? > On Aug. 7, 2020, 6:04 a.m., Greg Mann wrote: > > src/slave/containerizer/mesos/isolators/volume/csi/isolator.cpp > > Lines 389-391 (patched) > > <https://reviews.apache.org/r/72734/diff/1/?file=2237067#file2237067line389> > > > > I'm wondering if there's an issue that if we return a failure here, the > > state on disk (the `volumes` file for this container) and the state in > > memory (the state stored in `infos`) will be inconsistent. > > > > If the agent were to restart, the CSI volume isolator could clean up > > that file if it refers to an unknown container. In that case, we will have > > left a volume published on this node without a record of it, but maybe > > that's OK? In that case, when agent is restarted, containerizer will try to destroy the container again which in turn will call `cleanup` to try to unpublish the volume again. Before the agent is restarted, yes we will have left a volume published on the node without a record of it in memory (since we have already done `infos.erase(containerId)` before making the unpublish call), but I think it should OK. Do you see any problems with it? - Qian ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/72734/#review221485 ----------------------------------------------------------- On Aug. 5, 2020, 5:20 p.m., Qian Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/72734/ > ----------------------------------------------------------- > > (Updated Aug. 5, 2020, 5:20 p.m.) > > > Review request for mesos, Andrei Budnik and Greg Mann. > > > Bugs: MESOS-10154 > https://issues.apache.org/jira/browse/MESOS-10154 > > > Repository: mesos > > > Description > ------- > > Implemented the `cleanup` method of `volume/csi` isolator. > > > Diffs > ----- > > src/slave/containerizer/mesos/isolators/volume/csi/isolator.hpp > PRE-CREATION > src/slave/containerizer/mesos/isolators/volume/csi/isolator.cpp > PRE-CREATION > > > Diff: https://reviews.apache.org/r/72734/diff/1/ > > > Testing > ------- > > > Thanks, > > Qian Zhang > >
