Re: OOM killer stuck when freezing cgroup

Benjamin Mahler Fri, 10 May 2013 16:11:42 -0700

The purpose of first freezing the cgroup is to attempt to atomically kill
of the group. If there's an issue in freezing a cgroup, that's problematic.
It's possible the kernel is misbehaving, or our code is misbehaving, but we
should try to figure out what's causing the freezing issue.


Sending SIGKILL to all the pids without at least stopping them is going to
be incorrect, we have a killtree.sh that you can look at to get an idea of
how we stop each process first (when using process isolation). For cgroups
isolation, the freezer makes this stopping process simple for us :)


On Fri, May 10, 2013 at 3:58 PM, Brenden Matthews <
[email protected]> wrote:

> I can only perform tasks which don't touch the same partition that the
> slave read/writes from (i.e., /tmp).  It seems that the entire device
> becomes locked and I'm not able to touch it.
>
> What about doing the following sequence:
>
> killTasks():
>
>     chain = kill()                      // Kill current tasks.
>       .then(defer(self(), &Self::freeze))   // Freeze the cgroup.
>       .then(defer(self(), &Self::kill))   // Kill any remaining (i.e.,
> recently forked) tasks.
>       .then(defer(self(), &Self::thaw))   // Thaw cgroup to deliver
> signals.
>       .then(defer(self(), &Self::empty)); // Wait until cgroup is empty.
>
> Perhaps this will help to un-stick the stuck processes before trying to
> freeze the whole thing?  I might try it and see how it goes.
>
>
> On Fri, May 10, 2013 at 2:58 PM, Vinod Kone <[email protected]> wrote:
>
> > Hi Brenden,
> >
> > The reason we do a freeze is so that no new processes are forked while we
> > are killing them. For what it is worth, we have been running cgroups in
> > production at twitter and I don't remember seeing this issue. Though, I
> > must admit, we haven't run Hadoop Tasktrackers with cgroups here. So, I
> > would like to understand if this is a mesos/kernel bug before we plan a
> > fix.
> >
> > IIUC, you were not able to gdb into the mesos slave when it is stuck? Is
> > this because the machine itself is hung?
> >
> >
> > On Fri, May 10, 2013 at 2:40 PM, Brenden Matthews <
> > [email protected]> wrote:
> >
> > > I think that the kernel logs only reveal what we already know: some
> > process
> > > is stuck because it's blocked on a disk read/write.  It doesn't explain
> > why
> > > it's blocked, or what's blocking it.  I suspect that the freezer itself
> > is
> > > the culprit.  When we want to kill a process, maybe we should just kill
> > it
> > > rather than freezing it and then killing it?
> > >
> > >
> > > On Fri, May 10, 2013 at 2:26 PM, Benjamin Mahler
> > > <[email protected]>wrote:
> > >
> > > > The kernel logs sound concerning to me and a cursory google search
> > > reveals
> > > > some kernel bug tickets. Is there a time correlation there?
> > > >
> > > >
> > > > On Fri, May 10, 2013 at 1:13 PM, Brenden Matthews <
> > > > [email protected]> wrote:
> > > >
> > > > > So after cat'ing around in /proc, I can see that the JVM is
> blocking
> > on
> > > > > disk I/O apparently:
> > > > >
> > > > > root@i-bde90adc:~# cat /proc/24689/status
> > > > > > Name: java
> > > > > > State: D (disk sleep)
> > > > > > Tgid: 24689
> > > > > > Pid: 24689
> > > > > > PPid: 24629
> > > > > > TracerPid: 0
> > > > > > Uid: 997 997 997 997
> > > > > > Gid: 997 997 997 997
> > > > > > FDSize: 512
> > > > > > Groups:
> > > > > > VmPeak: 2198700 kB
> > > > > > VmSize: 2196016 kB
> > > > > > VmLck:       0 kB
> > > > > > VmPin:       0 kB
> > > > > > VmHWM:  305276 kB
> > > > > > VmRSS:  296828 kB
> > > > > > VmData: 2040128 kB
> > > > > > VmStk:     156 kB
> > > > > > VmExe:       4 kB
> > > > > > VmLib:   32692 kB
> > > > > > VmPTE:    1292 kB
> > > > > > VmSwap:       0 kB
> > > > > > Threads: 104
> > > > > > SigQ: 0/180427
> > > > > > SigPnd: 0000000000000000
> > > > > > ShdPnd: 0000000000000000
> > > > > > SigBlk: 0000000000000000
> > > > > > SigIgn: 0000000000000000
> > > > > > SigCgt: 2000000181005ccf
> > > > > > CapInh: 0000000000000000
> > > > > > CapPrm: 0000000000000000
> > > > > > CapEff: 0000000000000000
> > > > > > CapBnd: ffffffffffffffff
> > > > > > Cpus_allowed: ffff
> > > > > > Cpus_allowed_list: 0-15
> > > > > > Mems_allowed: 00000000,00000001
> > > > > > Mems_allowed_list: 0
> > > > > > voluntary_ctxt_switches: 25
> > > > > > nonvoluntary_ctxt_switches: 0
> > > > >
> > > > >
> > > > >
> > > > > Mesos looks okay:
> > > > >
> > > > > root@i-bde90adc:~# cat /proc/3684/status
> > > > > > Name: lt-mesos-slave
> > > > > > State: S (sleeping)
> > > > > > Tgid: 3684
> > > > > > Pid: 3684
> > > > > > PPid: 1131
> > > > > > TracerPid: 0
> > > > > > Uid: 0 0 0 0
> > > > > > Gid: 0 0 0 0
> > > > > > FDSize: 256
> > > > > > Groups:
> > > > > > VmPeak: 1531868 kB
> > > > > > VmSize: 1531864 kB
> > > > > > VmLck:       0 kB
> > > > > > VmPin:       0 kB
> > > > > > VmHWM:   15272 kB
> > > > > > VmRSS:   15272 kB
> > > > > > VmData: 1402012 kB
> > > > > > VmStk:     136 kB
> > > > > > VmExe:     152 kB
> > > > > > VmLib:   31824 kB
> > > > > > VmPTE:     416 kB
> > > > > > VmSwap:       0 kB
> > > > > > Threads: 20
> > > > > > SigQ: 1/180427
> > > > > > SigPnd: 0000000000000000
> > > > > > ShdPnd: 0000000000000000
> > > > > > SigBlk: 0000000000000000
> > > > > > SigIgn: 0000000000000000
> > > > > > SigCgt: 0000000180000000
> > > > > > CapInh: 0000000000000000
> > > > > > CapPrm: ffffffffffffffff
> > > > > > CapEff: ffffffffffffffff
> > > > > > CapBnd: ffffffffffffffff
> > > > > > Cpus_allowed: ffff
> > > > > > Cpus_allowed_list: 0-15
> > > > > > Mems_allowed: 00000000,00000001
> > > > > > Mems_allowed_list: 0
> > > > > > voluntary_ctxt_switches: 69
> > > > > > nonvoluntary_ctxt_switches: 0
> > > > >
> > > > >
> > > > > Is it possible that changing the state of the freezer from thawed
> ->
> > > > frozen
> > > > > causes all disk I/O to stop?  I'm not sure where to look to
> mitigate
> > > > this.
> > > > >
> > > > >
> > > > > On Fri, May 10, 2013 at 11:58 AM, Brenden Matthews <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > The kernel gives me a bunch of "soft lockup: CPU stuck!" and
> 'task
> > > java
> > > > > > blocked for more than 120 seconds".  I think these are fairly
> > generic
> > > > > > errors and can be caused by any sort of race condition on the
> > system.
> > > > > >  Strace also isn't helpful, as it doesn't tell me which resource
> > the
> > > > > > processes are stuck on.
> > > > > >
> > > > > > I'll try and check /proc/pid/status next time it happens.  Other
> > > tools
> > > > > > like lsof are unable to run.
> > > > > >
> > > > > >
> > > > > > On Fri, May 10, 2013 at 11:49 AM, Vinod Kone <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > >> ---------- Forwarded message ----------
> > > > > >> From: Vinod Kone <[email protected]>
> > > > > >> Date: Fri, May 10, 2013 at 11:25 AM
> > > > > >> Subject: Re: OOM killer stuck when freezing cgroup
> > > > > >> To: "[email protected]" <
> > > [email protected]>
> > > > > >>
> > > > > >>
> > > > > >> Anything in the kernel logs during this time?
> > > > > >>
> > > > > >>
> > > > > >> @vinodkone
> > > > > >>
> > > > > >>
> > > > > >> On Fri, May 10, 2013 at 11:13 AM, Benjamin Mahler <
> > > > > >> [email protected]
> > > > > >> > wrote:
> > > > > >>
> > > > > >> > Sorry about that miscommunication, I meant to say "from the
> > kernel
> > > > OOM
> > > > > >> > killer", and you're right that the OOM shouldn't be the source
> > of
> > > > the
> > > > > >> > problem here (just trying to make a mental note that our
> > > accounting
> > > > is
> > > > > >> > wrong), OOMs are indeed expected as part of normal operation
> on
> > > the
> > > > > >> slave.
> > > > > >> >
> > > > > >> > It's tough to say what the issue is as I don't have any
> > experience
> > > > > >> running
> > > > > >> > on newer kernels like 3.2, I would recommend augmenting the
> > > logging
> > > > in
> > > > > >> the
> > > > > >> > cgroups::Freezer to try to nail down at which point the slave
> is
> > > > > >> blocking.
> > > > > >> > In particular, in Freezer::freeze() and
> Freezer::watchFrozen().
> > > > > >> >
> > > > > >> > Also, do you have /proc/<slave_pid>/status output for the
> slave
> > > when
> > > > > >> it's
> > > > > >> > frozen?
> > > > > >> >
> > > > > >> > It will be really useful to know if that is consistently the
> > last
> > > > log
> > > > > >> line
> > > > > >> > you see before things lock up as well, as that helps narrow
> this
> > > > down.
> > > > > >> >
> > > > > >> >
> > > > > >> > On Fri, May 10, 2013 at 10:44 AM, Brenden Matthews <
> > > > > >> > [email protected]> wrote:
> > > > > >> >
> > > > > >> > > This isn't a kernel OOM.  I started the debugger and let it
> go
> > > > > >> overnight
> > > > > >> > > with no luck.  It just doesn't want to budge.  I'm aware of
> > the
> > > > > memory
> > > > > >> > > limit issue, but I don't think that's the real problem here.
> > > > > >>  Regardless
> > > > > >> > of
> > > > > >> > > whether we run out of memory, the whole system shouldn't
> > freeze
> > > > when
> > > > > >> > trying
> > > > > >> > > to kill the process.  I'm not sure if this is a kernel
> problem
> > > or
> > > > a
> > > > > >> Mesos
> > > > > >> > > problem, but I suspect that the issue isn't with the kernel.
> > > > >  Walking
> > > > > >> > > through the Mesos code, I can't see anything that is
> obviously
> > > > wrong
> > > > > >> > > either.  I'm using my own branch from here:
> > > > > >> > > https://github.com/airbnb/mesos/tree/testing.
> > > > > >> > >
> > > > > >> > > My next step will be to run a bunch of slaves in GDB and
> then
> > > > launch
> > > > > >> some
> > > > > >> > > tasks.  Hopefully I can stop them and get a backtrace.  I
> > > thought
> > > > it
> > > > > >> > might
> > > > > >> > > be worth asking about it here to see if anyone else has
> dealt
> > > with
> > > > > >> this
> > > > > >> > and
> > > > > >> > > knows what's going on.
> > > > > >> > >
> > > > > >> > > These are EC2 instances running the latest Ubuntu Precise
> > AMIs.
> > > > > >>  Here's
> > > > > >> > the
> > > > > >> > > `uname -a' from one:
> > > > > >> > >
> > > > > >> > > Linux i-b5dbc1d5 3.2.0-41-virtual #66-Ubuntu SMP Thu Apr 25
> > > > 03:47:17
> > > > > >> UTC
> > > > > >> > > 2013 x86_64 x86_64 x86_64 GNU/Linux
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Fri, May 10, 2013 at 10:21 AM, Benjamin Mahler <
> > > > > >> > > [email protected]
> > > > > >> > > > wrote:
> > > > > >> > >
> > > > > >> > > > Hey Brenden, what kernel version are you running? What
> > version
> > > > of
> > > > > >> the
> > > > > >> > > code
> > > > > >> > > > are you running? Are you seeing this on different
> machines?
> > > Can
> > > > > you
> > > > > >> > > confirm
> > > > > >> > > > you're unable to attach to the slave with gdb (that would
> be
> > > the
> > > > > >> > easiest
> > > > > >> > > > way to figure out what's blocking in the slave)? Is that
> > > always
> > > > > the
> > > > > >> > last
> > > > > >> > > > line in the log before the processes get stuck?
> > > > > >> > > >
> > > > > >> > > > As an aside, it looks like we aren't properly setting the
> > JVM
> > > > heap
> > > > > >> size
> > > > > >> > > for
> > > > > >> > > > TaskTrackers in the MesosScheduler for Hadoop. We try to
> > > > estimate
> > > > > >> the
> > > > > >> > JVM
> > > > > >> > > > memory overhead, but it was perhaps not enough to avoid
> > kernel
> > > > > OOMs.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews <
> > > > > >> > > > [email protected]> wrote:
> > > > > >> > > >
> > > > > >> > > > > Hey folks,
> > > > > >> > > > >
> > > > > >> > > > > I'm bumping in to a problem frequently.  When an
> executor
> > > > > exceeds
> > > > > >> its
> > > > > >> > > > > memory limit and the executor is deemed in need of
> > > execution,
> > > > > the
> > > > > >> > > entire
> > > > > >> > > > > system becomes stuck (though I'm unsure as to whether
> it's
> > > > stuck
> > > > > >> > before
> > > > > >> > > > or
> > > > > >> > > > > after attempting to freeze).  Here's what the mesos
> slave
> > > log
> > > > > >> looks
> > > > > >> > > like:
> > > > > >> > > > >
> > > > > >> > > > > I0510 04:22:37.935608 13622 cgroups_isolator.cpp:1023]
> OOM
> > > > > >> notifier
> > > > > >> > is
> > > > > >> > > > > > triggered for executor executor_Task_Tracker_446 of
> > > > framework
> > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > >> > > > > > I0510 04:22:37.935925 13622 cgroups_isolator.cpp:1068]
> > OOM
> > > > > >> detected
> > > > > >> > > for
> > > > > >> > > > > > executor executor_Task_Tracker_446 of framework
> > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > >> > > > > > I0510 04:22:37.936982 13622 cgroups_isolator.cpp:1109]
> > > > Memory
> > > > > >> limit
> > > > > >> > > > > > exceeded: Requested: 2570MB Used: 2570MB
> > > > > >> > > > > > MEMORY STATISTICS:
> > > > > >> > > > > > cache 1682083840
> > > > > >> > > > > > rss 1012756480
> > > > > >> > > > > > mapped_file 1085440
> > > > > >> > > > > > pgpgin 22863635
> > > > > >> > > > > > pgpgout 22205715
> > > > > >> > > > > > swap 0
> > > > > >> > > > > > pgfault 11881912
> > > > > >> > > > > > pgmajfault 304
> > > > > >> > > > > > inactive_anon 0
> > > > > >> > > > > > active_anon 1012756480
> > > > > >> > > > > > inactive_file 1065136128
> > > > > >> > > > > > active_file 616947712
> > > > > >> > > > > > unevictable 0
> > > > > >> > > > > > hierarchical_memory_limit 2694840320
> > > > > >> > > > > > hierarchical_memsw_limit 9223372036854775807
> > > > > >> > > > > > total_cache 1682083840
> > > > > >> > > > > > total_rss 1012756480
> > > > > >> > > > > > total_mapped_file 1085440
> > > > > >> > > > > > total_pgpgin 22863635
> > > > > >> > > > > > total_pgpgout 22205715
> > > > > >> > > > > > total_swap 0
> > > > > >> > > > > > total_pgfault 11881912
> > > > > >> > > > > > total_pgmajfault 304
> > > > > >> > > > > > total_inactive_anon 0
> > > > > >> > > > > > total_active_anon 1012756480
> > > > > >> > > > > > total_inactive_file 1065136128
> > > > > >> > > > > > total_active_file 616947712
> > > > > >> > > > > > total_unevictable 0
> > > > > >> > > > > > I0510 04:22:37.965666 13622 cgroups_isolator.cpp:620]
> > > > Killing
> > > > > >> > > executor
> > > > > >> > > > > > executor_Task_Tracker_446 of framework
> > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000
> > > > > >> > > > > > I0510 04:22:37.967418 13621 cgroups.cpp:1175] Trying
> to
> > > > freeze
> > > > > >> > cgroup
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > After the last line in the log, both the executing task
> > and
> > > > the
> > > > > >> mesos
> > > > > >> > > > > process are stuck indefinitely.  Usually the way to
> > resolve
> > > > this
> > > > > >> is
> > > > > >> > > > > `killall -9 java lt-mesos-slave' or a reboot.
> > > > > >> > > > >
> > > > > >> > > > > If I cat /cgroup/mesos/framework.../freezer.state, it
> > shows
> > > > that
> > > > > >> it's
> > > > > >> > > > > freezing.  Since the entire system is stuck, I cannot
> > > attach a
> > > > > >> > debugger
> > > > > >> > > > to
> > > > > >> > > > > either the JVM or mesos.
> > > > > >> > > > >
> > > > > >> > > > > Any ideas?  This problem is very frustrating and I'm
> stuck
> > > at
> > > > > this
> > > > > >> > > point.
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: OOM killer stuck when freezing cgroup

Reply via email to