> On Sept. 21, 2012, 3:23 a.m., Jie Yu wrote:
> > Ben, I am just curious whether you have observed a case in which a retry is 
> > useful?
> > 
> > From my experience, if a cgroup stucks at FREEZING state (e.g. some process 
> > is in T or Z state), writing FROZEN to retry never brings the state to 
> > FROZEN.
> > 
> > If you do see a case that a retry is useful, let me know.
> 
> Benjamin Hindman wrote:
>     We've actually seen cases in which a process in the cgroup is still in R! 
> It's possible that at the time the kernel could not freeze that process for 
> whatever reason, and so retrying seems to be the only option (although, I 
> hope that it's not the case that the process can never be frozen, which would 
> seem like a pretty serious design issue).
> 
> Jie Yu wrote:
>     > We've actually seen cases in which a process in the cgroup is still in 
> R!
>     
>     Maybe this is a kernel bug (race condition?) ;) from my understanding of 
> the kernel code, this seems to be impossible...
>     
>     You can take a look at "kernel/cgroup_freezer.c"
>     
>     Probably you can start with the function "freezer_write(...)"
> 
> Benjamin Hindman wrote:
>     Hmm, so is the documentation out of date? The documentation makes me 
> think that partially frozen cgroups are indeed possible and expected, and the 
> user might need to try and freeze a cgroup multiple times (I attached the 
> relevant snippet from the documentation in the review summary above).
> 
> Jie Yu wrote:
>     No, I am not saying that the doc is out-of-date. What I am trying to 
> understand is why a process in "R" state cannot be frozen.
>     
>     I will take a look at the kernel code that you use, and let you know the 
> possible explanation.
> 
> Benjamin Hindman wrote:
>     Sounds great, thanks! In the mean time, I'll commit this change and see 
> if it fixes the issue.

I ran 50 tasks, each that forked off 20 processes (where each process 
technically forked ~4 subprocesses.) 

The memory limit for the tasks was about 10% too low for start-up, but just 
about right for the steady-state, which resulted in non-deterministic OOMing of 
tasks.  Eventually all of them scheduled and were running fine, but first 
taking about ~200 tasks getting OOMed first.  So we had a big sample set of OOM 
kills.  Of the ~200, 3 got stuck into this state.  The freezer froze _right_ in 
the middle of those 80 forks, and the cgroup was left in FREEZING state with 
only one process in R and the rest in D/Ds.


- Brian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7203/#review11766
-----------------------------------------------------------


On Sept. 21, 2012, 2:02 a.m., Benjamin Hindman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7203/
> -----------------------------------------------------------
> 
> (Updated Sept. 21, 2012, 2:02 a.m.)
> 
> 
> Review request for mesos, Vinod Kone, Brian Wickman, and Jie Yu.
> 
> 
> Description
> -------
> 
> See summary and 
> http://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt:
> 
> It's important to note that freezing can be incomplete. In that case we return
> EBUSY. This means that some tasks in the cgroup are busy doing something that
> prevents us from completely freezing the cgroup at this time. After EBUSY,
> the cgroup will remain partially frozen -- reflected by freezer.state 
> reporting
> "FREEZING" when read. The state will remain "FREEZING" until one of these
> things happens:
> 
>       1) Userspace cancels the freezing operation by writing "THAWED" to
>               the freezer.state file
>       2) Userspace retries the freezing operation by writing "FROZEN" to
>               the freezer.state file (writing "FREEZING" is not legal
>               and returns EINVAL)
>       3) The tasks that blocked the cgroup from entering the "FROZEN"
>               state disappear from the cgroup's set of tasks.
> 
> 
> Diffs
> -----
> 
>   src/linux/cgroups.cpp 4efd06e 
> 
> Diff: https://reviews.apache.org/r/7203/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Benjamin Hindman
> 
>

Reply via email to