> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1158-1171 (patched) > > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1158> > > > > Since the connection of this to the surrounding could is not > > immediately clear, could this be made part of > > `recoverResourceProviderState`? > > Chun-Hung Hsiao wrote: > This is for constructing the list of operations for SUM (which is an > external component wrt the SLRP) to report its checkpointing, so the SLRP > could *reconcile* the the status updates. `recoverResourceProviderState` is > for reading the SLRP's own state checkpoint and reconstruct its bookkeeping > in memory so it seems not a good idea to mix the logic here and that function.
Did a bit of refactoring. Now the garbage collection logic is moved to the `garbageCollectOperationPath` function. > On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1175-1187 (original), 1191-1203 (patched) > > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1191> > > > > Since at this point `uuid` is already not tracked anymore, I'd suggest > > to move this garbage collection into `checkpointResourceProviderState`. > > > > In that more general approach we should probably always check that > > `path` exists before turning an `Error` to remove it into a `Failure`. > > Chun-Hung Hsiao wrote: > The path is removed only if the operation is terminal. But we call > `checkpointResourceProviderState` after every change in totol resources, > operations, etc. Since this logic is exercised at a specific scenario, it > seems better to keep that function simple. > > Also note that the operation path is created by the SUM, not by SLRP. Refactored to `garbageCollectOperationPath`. > On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1797-1811 (original), 1814-1828 (patched) > > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1814> > > > > It seems this could be part of `checkpointResourceProviderState`, see > > above. Refactored to `garbaceCollectOperationPath`. - Chun-Hung ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69010/#review209542 ----------------------------------------------------------- On Oct. 15, 2018, 10:08 p.m., Chun-Hung Hsiao wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69010/ > ----------------------------------------------------------- > > (Updated Oct. 15, 2018, 10:08 p.m.) > > > Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht. > > > Bugs: MESOS-9281 > https://issues.apache.org/jira/browse/MESOS-9281 > > > Repository: mesos > > > Description > ------- > > Currently if a system crashes, SLRP checkpoints might not be synced to > the filesystem, so it is possible that an old or empty checkpoint will > be read upon recovery. Moreover, if a CSI call has been issued right > before the crash, the recovered state may be inconsistent with the > actual state reported by the plugin. For example, the plugin might have > created a volume but the checkpointed state does not know about it. > > To avoid this inconsistency, we always call fsync() when checkpointing > SLRP states. > > > Diffs > ----- > > src/resource_provider/storage/provider.cpp > db783b53558811081fb2671e005e8bbbd9edbede > src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 > > > Diff: https://reviews.apache.org/r/69010/diff/2/ > > > Testing > ------- > > make check > > > Thanks, > > Chun-Hung Hsiao > >
