Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Benjamin Bannier Mon, 15 Oct 2018 07:22:27 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------





src/resource_provider/storage/provider.cpp
Line 663 (original), 663 (patched)
<https://reviews.apache.org/r/69010/#comment294055>

    This could be a separate, absolutely non-controversal patch.



src/resource_provider/storage/provider.cpp
Lines 1158-1171 (patched)
<https://reviews.apache.org/r/69010/#comment294054>

    Since the connection of this to the surrounding could is not immediately 
clear, could this be made part of `recoverResourceProviderState`?



src/resource_provider/storage/provider.cpp
Lines 1175-1187 (original), 1191-1203 (patched)
<https://reviews.apache.org/r/69010/#comment294056>

    Since at this point `uuid` is already not tracked anymore, I'd suggest to 
move this garbage collection into `checkpointResourceProviderState`.
    
    In that more general approach we should probably always check that `path` 
exists before turning an `Error` to remove it into a `Failure`.



src/resource_provider/storage/provider.cpp
Lines 1797-1811 (original), 1814-1828 (patched)
<https://reviews.apache.org/r/69010/#comment294057>

    It seems this could be part of `checkpointResourceProviderState`, see above.



src/slave/state.hpp
Lines 126 (patched)
<https://reviews.apache.org/r/69010/#comment294051>

    See below.



src/slave/state.hpp
Line 136 (original), 137 (patched)
<https://reviews.apache.org/r/69010/#comment294052>

    See below.



src/slave/state.hpp
Lines 173 (patched)
<https://reviews.apache.org/r/69010/#comment294053>

    See below.



src/slave/state.hpp
Line 192 (original), 196 (patched)
<https://reviews.apache.org/r/69010/#comment294045>

    I agree with James here. It seems totally fine to me to _always `sync`_ 
here. Could we do that? Alternatively we could introduce a dedicated function 
with weaker guarantees (e.g., `try_checkpoint`), but I don't see many good 
reasons for that, yet.



src/slave/state.hpp
Line 214 (original), 218 (patched)
<https://reviews.apache.org/r/69010/#comment294046>

    Is this `sync` required here? It seems syncing after below `rename` would 
be totally fine.


- Benjamin Bannier


On Oct. 13, 2018, 1:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2018, 1:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp 
> db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Reply via email to