> On Feb. 13, 2019, 1:28 p.m., Andrei Budnik wrote: > > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp > > Lines 300 (patched) > > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300> > > > > Given that `state::checkpoint` is **atomic**, we can not end up in the > > state where the file is empty because the agent did not finish writing to > > it. > > > > However, an empty file might occur in case of hard reboot of the > > agent's host. This happens because page cache is dumped every 20 seconds by > > default in Linux. There is a chance that the file is created, but data has > > not yet synced on disk. > > > > As we have agreed with Gilbert, we need to ignore empty files **only** > > in case of orphan containers. > > Qian Zhang wrote: > > Given that state::checkpoint is atomic, we can not end up in the state > where the file is empty because the agent did not finish writing to it. > > However, an empty file might occur in case of hard reboot of the > agent's host. This happens because page cache is dumped every 20 seconds by > default in Linux. There is a chance that the file is created, but data has > not yet synced on disk. > > Agree. And I see the comment `// This could happen if the slave died > after opening the file for writing but before it checkpointed anything.` in a > couple of places in Mesos code (e.g., `slave/state.cpp`, > `metadata_manager.cpp`), I think those comments need to be updated as well. > > > As we have agreed with Gilbert, we need to ignore empty files only in > case of orphan containers. > > Can you please elaborate a bit? Why do we want to treat orphan containers > and recoverable containers differently? How will we handle the recoverable > containers in this case?
> I think those comments need to be updated as well. We can update these comments in a separate patch later. > Why do we want to treat orphan containers and recoverable containers > differently? How will we handle the recoverable containers in this case? I think that `recoverable` containers could not have empty state files by construction as checkpointing state is atomic. If this invariant does not satisfy, then we definitely have a bug in our code which we need to fix. In the case of hard reboots, this invariant might be broken, but all containers are `orphan`. If the reasoning above looks acceptable, then we might want to recover after broken invariant for `orphan` containers, while keeping this error for `non-orphan` containers as an assertion (for us, developers) that the invariant could not be broken in normal circumstances. - Andrei ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69972/#review212795 ----------------------------------------------------------- On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69972/ > ----------------------------------------------------------- > > (Updated Feb. 13, 2019, 8:26 a.m.) > > > Review request for mesos, Andrei Budnik and Gilbert Song. > > > Bugs: MESOS-9507 > https://issues.apache.org/jira/browse/MESOS-9507 > > > Repository: mesos > > > Description > ------- > > There are two cases we need to handle: > 1. The checkpointed docker volumes file does not exist. > 2. The checkpointed docker volumes file is empty. > For both of the two cases, in the recovery of `docker/volume` isolator, > we should remove the container's checkpoint directory and then skip the > container. > > > Diffs > ----- > > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp > a72fc84da6fb0f24d363dd4c635500510da675d8 > > > Diff: https://reviews.apache.org/r/69972/diff/1/ > > > Testing > ------- > > > Thanks, > > Qian Zhang > >