> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300>
> >
> >     Given that `state::checkpoint` is **atomic**, we can not end up in the 
> > state where the file is empty because the agent did not finish writing to 
> > it.
> >     
> >     However, an empty file might occur in case of hard reboot of the 
> > agent's host. This happens because page cache is dumped every 20 seconds by 
> > default in Linux. There is a chance that the file is created, but data has 
> > not yet synced on disk.
> >     
> >     As we have agreed with Gilbert, we need to ignore empty files **only** 
> > in case of orphan containers.
> 
> Qian Zhang wrote:
>     > Given that state::checkpoint is atomic, we can not end up in the state 
> where the file is empty because the agent did not finish writing to it.
>     > However, an empty file might occur in case of hard reboot of the 
> agent's host. This happens because page cache is dumped every 20 seconds by 
> default in Linux. There is a chance that the file is created, but data has 
> not yet synced on disk.
>     
>     Agree. And I see the comment `// This could happen if the slave died 
> after opening the file for writing but before it checkpointed anything.` in a 
> couple of places in Mesos code (e.g., `slave/state.cpp`, 
> `metadata_manager.cpp`), I think those comments need to be updated as well.
>     
>     > As we have agreed with Gilbert, we need to ignore empty files only in 
> case of orphan containers.
>     
>     Can you please elaborate a bit? Why do we want to treat orphan containers 
> and recoverable containers differently? How will we handle the recoverable 
> containers in this case?
> 
> Andrei Budnik wrote:
>     > I think those comments need to be updated as well.
>     
>     We can update these comments in a separate patch later.
>     
>     > Why do we want to treat orphan containers and recoverable containers 
> differently? How will we handle the recoverable containers in this case?
>     
>     I think that `recoverable` containers could not have empty state files by 
> construction as checkpointing state is atomic. If this invariant does not 
> satisfy, then we definitely have a bug in our code which we need to fix.
>     In the case of hard reboots, this invariant might be broken, but all 
> containers are `orphan`.
>     
>     If the reasoning above looks acceptable, then we might want to recover 
> after broken invariant for `orphan` containers, while keeping this error for 
> `non-orphan` containers as an assertion (for us, developers) that the 
> invariant could not be broken in normal circumstances.

Agree, and I posted a patch for updating comments here: 
https://reviews.apache.org/r/70001/


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp 
> a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to