Re: OOM killer stuck when freezing cgroup

Brenden Matthews Fri, 10 May 2013 16:24:14 -0700

I'll run some tests over the weekend and see how it goes.  I've made 2
small changes: increasing the JVM overhead in the scheduler, and trying
this kill -> freeze -> kill -> thaw pattern.  When I kill the process
manually the system recovers, so I'm willing to recon this is a reasonably
correct order of events.  I find it very unlikely that there is actually a
kernel bug here.



On Fri, May 10, 2013 at 4:10 PM, Benjamin Mahler
<[email protected]>wrote:

> The purpose of first freezing the cgroup is to attempt to atomically kill
> of the group. If there's an issue in freezing a cgroup, that's problematic.
> It's possible the kernel is misbehaving, or our code is misbehaving, but we
> should try to figure out what's causing the freezing issue.
>
> Sending SIGKILL to all the pids without at least stopping them is going to
> be incorrect, we have a killtree.sh that you can look at to get an idea of
> how we stop each process first (when using process isolation). For cgroups
> isolation, the freezer makes this stopping process simple for us :)
>
>
> On Fri, May 10, 2013 at 3:58 PM, Brenden Matthews <
> [email protected]> wrote:
>
> > I can only perform tasks which don't touch the same partition that the
> > slave read/writes from (i.e., /tmp).  It seems that the entire device
> > becomes locked and I'm not able to touch it.
> >
> > What about doing the following sequence:
> >
> > killTasks():
> >
> >     chain = kill()                      // Kill current tasks.
> >       .then(defer(self(), &Self::freeze))   // Freeze the cgroup.
> >       .then(defer(self(), &Self::kill))   // Kill any remaining (i.e.,
> > recently forked) tasks.
> >       .then(defer(self(), &Self::thaw))   // Thaw cgroup to deliver
> > signals.
> >       .then(defer(self(), &Self::empty)); // Wait until cgroup is empty.
> >
> > Perhaps this will help to un-stick the stuck processes before trying to
> > freeze the whole thing?  I might try it and see how it goes.
> >
> >
> > On Fri, May 10, 2013 at 2:58 PM, Vinod Kone <[email protected]> wrote:
> >
> > > Hi Brenden,
> > >
> > > The reason we do a freeze is so that no new processes are forked while
> we
> > > are killing them. For what it is worth, we have been running cgroups in
> > > production at twitter and I don't remember seeing this issue. Though, I
> > > must admit, we haven't run Hadoop Tasktrackers with cgroups here. So, I
> > > would like to understand if this is a mesos/kernel bug before we plan a
> > > fix.
> > >
> > > IIUC, you were not able to gdb into the mesos slave when it is stuck?
> Is
> > > this because the machine itself is hung?
> > >
> > >
> > > On Fri, May 10, 2013 at 2:40 PM, Brenden Matthews <
> > > [email protected]> wrote:
> > >
> > > > I think that the kernel logs only reveal what we already know: some
> > > process
> > > > is stuck because it's blocked on a disk read/write.  It doesn't
> explain
> > > why
> > > > it's blocked, or what's blocking it.  I suspect that the freezer
> itself
> > > is
> > > > the culprit.  When we want to kill a process, maybe we should just
> kill
> > > it
> > > > rather than freezing it and then killing it?
> > > >
> > > >
> > > > On Fri, May 10, 2013 at 2:26 PM, Benjamin Mahler
> > > > <[email protected]>wrote:
> > > >
> > > > > The kernel logs sound concerning to me and a cursory google search
> > > > reveals
> > > > > some kernel bug tickets. Is there a time correlation there?
> > > > >
> > > > >
> > > > > On Fri, May 10, 2013 at 1:13 PM, Brenden Matthews <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > So after cat'ing around in /proc, I can see that the JVM is
> > blocking
> > > on
> > > > > > disk I/O apparently:
> > > > > >
> > > > > > root@i-bde90adc:~# cat /proc/24689/status
> > > > > > > Name: java
> > > > > > > State: D (disk sleep)
> > > > > > > Tgid: 24689
> > > > > > > Pid: 24689
> > > > > > > PPid: 24629
> > > > > > > TracerPid: 0
> > > > > > > Uid: 997 997 997 997
> > > > > > > Gid: 997 997 997 997
> > > > > > > FDSize: 512
> > > > > > > Groups:
> > > > > > > VmPeak: 2198700 kB
> > > > > > > VmSize: 2196016 kB
> > > > > > > VmLck:       0 kB
> > > > > > > VmPin:       0 kB
> > > > > > > VmHWM:  305276 kB
> > > > > > > VmRSS:  296828 kB
> > > > > > > VmData: 2040128 kB
> > > > > > > VmStk:     156 kB
> > > > > > > VmExe:       4 kB
> > > > > > > VmLib:   32692 kB
> > > > > > > VmPTE:    1292 kB
> > > > > > > VmSwap:       0 kB
> > > > > > > Threads: 104
> > > > > > > SigQ: 0/180427
> > > > > > > SigPnd: 0000000000000000
> > > > > > > ShdPnd: 0000000000000000
> > > > > > > SigBlk: 0000000000000000
> > > > > > > SigIgn: 0000000000000000
> > > > > > > SigCgt: 2000000181005ccf
> > > > > > > CapInh: 0000000000000000
> > > > > > > CapPrm: 0000000000000000
> > > > > > > CapEff: 0000000000000000
> > > > > > > CapBnd: ffffffffffffffff
> > > > > > > Cpus_allowed: ffff
> > > > > > > Cpus_allowed_list: 0-15
> > > > > > > Mems_allowed: 00000000,00000001
> > > > > > > Mems_allowed_list: 0
> > > > > > > voluntary_ctxt_switches: 25
> > > > > > > nonvoluntary_ctxt_switches: 0
> > > > > >
> > > > > >
> > > > > >
> > > > > > Mesos looks okay:
> > > > > >
> > > > > > root@i-bde90adc:~# cat /proc/3684/status
> > > > > > > Name: lt-mesos-slave
> > > > > > > State: S (sleeping)
> > > > > > > Tgid: 3684
> > > > > > > Pid: 3684
> > > > > > > PPid: 1131
> > > > > > > TracerPid: 0
> > > > > > > Uid: 0 0 0 0
> > > > > > > Gid: 0 0 0 0
> > > > > > > FDSize: 256
> > > > > > > Groups:
> > > > > > > VmPeak: 1531868 kB
> > > > > > > VmSize: 1531864 kB
> > > > > > > VmLck:       0 kB
> > > > > > > VmPin:       0 kB
> > > > > > > VmHWM:   15272 kB
> > > > > > > VmRSS:   15272 kB
> > > > > > > VmData: 1402012 kB
> > > > > > > VmStk:     136 kB
> > > > > > > VmExe:     152 kB
> > > > > > > VmLib:   31824 kB
> > > > > > > VmPTE:     416 kB
> > > > > > > VmSwap:       0 kB
> > > > > > > Threads: 20
> > > > > > > SigQ: 1/180427
> > > > > > > SigPnd: 0000000000000000
> > > > > > > ShdPnd: 0000000000000000
> > > > > > > SigBlk: 0000000000000000
> > > > > > > SigIgn: 0000000000000000
> > > > > > > SigCgt: 0000000180000000
> > > > > > > CapInh: 0000000000000000
> > > > > > > CapPrm: ffffffffffffffff
> > > > > > > CapEff: ffffffffffffffff
> > > > > > > CapBnd: ffffffffffffffff
> > > > > > > Cpus_allowed: ffff
> > > > > > > Cpus_allowed_list: 0-15
> > > > > > > Mems_allowed: 00000000,00000001
> > > > > > > Mems_allowed_list: 0
> > > > > > > voluntary_ctxt_switches: 69
> > > > > > > nonvoluntary_ctxt_switches: 0
> > > > > >
> > > > > >
> > > > > > Is it possible that changing the state of the freezer from thawed
> > ->
> > > > > frozen
> > > > > > causes all disk I/O to stop?  I'm not sure where to look to
> > mitigate
> > > > > this.
> > > > > >
> > > > > >
> > > > > > On Fri, May 10, 2013 at 11:58 AM, Brenden Matthews <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > The kernel gives me a bunch of "soft lockup: CPU stuck!" and
> > 'task
> > > > java
> > > > > > > blocked for more than 120 seconds".  I think these are fairly
> > > generic
> > > > > > > errors and can be caused by any sort of race condition on the
> > > system.
> > > > > > >  Strace also isn't helpful, as it doesn't tell me which
> resource
> > > the
> > > > > > > processes are stuck on.
> > > > > > >
> > > > > > > I'll try and check /proc/pid/status next time it happens.
>  Other
> > > > tools
> > > > > > > like lsof are unable to run.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 10, 2013 at 11:49 AM, Vinod Kone <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > >> ---------- Forwarded message ----------
> > > > > > >> From: Vinod Kone <[email protected]>
> > > > > > >> Date: Fri, May 10, 2013 at 11:25 AM
> > > > > > >> Subject: Re: OOM killer stuck when freezing cgroup
> > > > > > >> To: "[email protected]" <
> > > > [email protected]>
> > > > > > >>
> > > > > > >>
> > > > > > >> Anything in the kernel logs during this time?
> > > > > > >>
> > > > > > >>
> > > > > > >> @vinodkone
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, May 10, 2013 at 11:13 AM, Benjamin Mahler <
> > > > > > >> [email protected]
> > > > > > >> > wrote:
> > > > > > >>
> > > > > > >> > Sorry about that miscommunication, I meant to say "from the
> > > kernel
> > > > > OOM
> > > > > > >> > killer", and you're right that the OOM shouldn't be the
> source
> > > of
> > > > > the
> > > > > > >> > problem here (just trying to make a mental note that our
> > > > accounting
> > > > > is
> > > > > > >> > wrong), OOMs are indeed expected as part of normal operation
> > on
> > > > the
> > > > > > >> slave.
> > > > > > >> >
> > > > > > >> > It's tough to say what the issue is as I don't have any
> > > experience
> > > > > > >> running
> > > > > > >> > on newer kernels like 3.2, I would recommend augmenting the
> > > > logging
> > > > > in
> > > > > > >> the
> > > > > > >> > cgroups::Freezer to try to nail down at which point the
> slave
> > is
> > > > > > >> blocking.
> > > > > > >> > In particular, in Freezer::freeze() and
> > Freezer::watchFrozen().
> > > > > > >> >
> > > > > > >> > Also, do you have /proc/<slave_pid>/status output for the
> > slave
> > > > when
> > > > > > >> it's
> > > > > > >> > frozen?
> > > > > > >> >
> > > > > > >> > It will be really useful to know if that is consistently the
> > > last
> > > > > log
> > > > > > >> line
> > > > > > >> > you see before things lock up as well, as that helps narrow
> > this
> > > > > down.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Fri, May 10, 2013 at 10:44 AM, Brenden Matthews <
> > > > > > >> > [email protected]> wrote:
> > > > > > >> >
> > > > > > >> > > This isn't a kernel OOM.  I started the debugger and let
> it
> > go
> > > > > > >> overnight
> > > > > > >> > > with no luck.  It just doesn't want to budge.  I'm aware
> of
> > > the
> > > > > > memory
> > > > > > >> > > limit issue, but I don't think that's the real problem
> here.
> > > > > > >>  Regardless
> > > > > > >> > of
> > > > > > >> > > whether we run out of memory, the whole system shouldn't
> > > freeze
> > > > > when
> > > > > > >> > trying
> > > > > > >> > > to kill the process.  I'm not sure if this is a kernel
> > problem
> > > > or
> > > > > a
> > > > > > >> Mesos
> > > > > > >> > > problem, but I suspect that the issue isn't with the
> kernel.
> > > > > >  Walking
> > > > > > >> > > through the Mesos code, I can't see anything that is
> > obviously
> > > > > wrong
> > > > > > >> > > either.  I'm using my own branch from here:
> > > > > > >> > > https://github.com/airbnb/mesos/tree/testing.
> > > > > > >> > >
> > > > > > >> > > My next step will be to run a bunch of slaves in GDB and
> > then
> > > > > launch
> > > > > > >> some
> > > > > > >> > > tasks.  Hopefully I can stop them and get a backtrace.  I
> > > > thought
> > > > > it
> > > > > > >> > might
> > > > > > >> > > be worth asking about it here to see if anyone else has
> > dealt
> > > > with
> > > > > > >> this
> > > > > > >> > and
> > > > > > >> > > knows what's going on.
> > > > > > >> > >
> > > > > > >> > > These are EC2 instances running the latest Ubuntu Precise
> > > AMIs.
> > > > > > >>  Here's
> > > > > > >> > the
> > > > > > >> > > `uname -a' from one:
> > > > > > >> > >
> > > > > > >> > > Linux i-b5dbc1d5 3.2.0-41-virtual #66-Ubuntu SMP Thu Apr
> 25
> > > > > 03:47:17
> > > > > > >> UTC
> > > > > > >> > > 2013 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Fri, May 10, 2013 at 10:21 AM, Benjamin Mahler <
> > > > > > >> > > [email protected]
> > > > > > >> > > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hey Brenden, what kernel version are you running? What
> > > version
> > > > > of
> > > > > > >> the
> > > > > > >> > > code
> > > > > > >> > > > are you running? Are you seeing this on different
> > machines?
> > > > Can
> > > > > > you
> > > > > > >> > > confirm
> > > > > > >> > > > you're unable to attach to the slave with gdb (that
> would
> > be
> > > > the
> > > > > > >> > easiest
> > > > > > >> > > > way to figure out what's blocking in the slave)? Is that
> > > > always
> > > > > > the
> > > > > > >> > last
> > > > > > >> > > > line in the log before the processes get stuck?
> > > > > > >> > > >
> > > > > > >> > > > As an aside, it looks like we aren't properly setting
> the
> > > JVM
> > > > > heap
> > > > > > >> size
> > > > > > >> > > for
> > > > > > >> > > > TaskTrackers in the MesosScheduler for Hadoop. We try to
> > > > > estimate
> > > > > > >> the
> > > > > > >> > JVM
> > > > > > >> > > > memory overhead, but it was perhaps not enough to avoid
> > > kernel
> > > > > > OOMs.
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews <
> > > > > > >> > > > [email protected]> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hey folks,
> > > > > > >> > > > >
> > > > > > >> > > > > I'm bumping in to a problem frequently.  When an
> > executor
> > > > > > exceeds
> > > > > > >> its
> > > > > > >> > > > > memory limit and the executor is deemed in need of
> > > > execution,
> > > > > > the
> > > > > > >> > > entire
> > > > > > >> > > > > system becomes stuck (though I'm unsure as to whether
> > it's
> > > > > stuck
> > > > > > >> > before
> > > > > > >> > > > or
> > > > > > >> > > > > after attempting to freeze).  Here's what the mesos
> > slave
> > > > log
> > > > > > >> looks
> > > > > > >> > > like:
> > > > > > >> > > > >
> > > > > > >> > > > > I0510 04:22:37.935608 13622 cgroups_isolator.cpp:1023]
> > OOM
> > > > > > >> notifier
> > > > > > >> > is
> > > > > > >> > > > > > triggered for executor executor_Task_Tracker_446 of
> > > > > framework
> > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > > >> > > > > > I0510 04:22:37.935925 13622
> cgroups_isolator.cpp:1068]
> > > OOM
> > > > > > >> detected
> > > > > > >> > > for
> > > > > > >> > > > > > executor executor_Task_Tracker_446 of framework
> > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > > >> > > > > > I0510 04:22:37.936982 13622
> cgroups_isolator.cpp:1109]
> > > > > Memory
> > > > > > >> limit
> > > > > > >> > > > > > exceeded: Requested: 2570MB Used: 2570MB
> > > > > > >> > > > > > MEMORY STATISTICS:
> > > > > > >> > > > > > cache 1682083840
> > > > > > >> > > > > > rss 1012756480
> > > > > > >> > > > > > mapped_file 1085440
> > > > > > >> > > > > > pgpgin 22863635
> > > > > > >> > > > > > pgpgout 22205715
> > > > > > >> > > > > > swap 0
> > > > > > >> > > > > > pgfault 11881912
> > > > > > >> > > > > > pgmajfault 304
> > > > > > >> > > > > > inactive_anon 0
> > > > > > >> > > > > > active_anon 1012756480
> > > > > > >> > > > > > inactive_file 1065136128
> > > > > > >> > > > > > active_file 616947712
> > > > > > >> > > > > > unevictable 0
> > > > > > >> > > > > > hierarchical_memory_limit 2694840320
> > > > > > >> > > > > > hierarchical_memsw_limit 9223372036854775807
> > > > > > >> > > > > > total_cache 1682083840
> > > > > > >> > > > > > total_rss 1012756480
> > > > > > >> > > > > > total_mapped_file 1085440
> > > > > > >> > > > > > total_pgpgin 22863635
> > > > > > >> > > > > > total_pgpgout 22205715
> > > > > > >> > > > > > total_swap 0
> > > > > > >> > > > > > total_pgfault 11881912
> > > > > > >> > > > > > total_pgmajfault 304
> > > > > > >> > > > > > total_inactive_anon 0
> > > > > > >> > > > > > total_active_anon 1012756480
> > > > > > >> > > > > > total_inactive_file 1065136128
> > > > > > >> > > > > > total_active_file 616947712
> > > > > > >> > > > > > total_unevictable 0
> > > > > > >> > > > > > I0510 04:22:37.965666 13622
> cgroups_isolator.cpp:620]
> > > > > Killing
> > > > > > >> > > executor
> > > > > > >> > > > > > executor_Task_Tracker_446 of framework
> > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000
> > > > > > >> > > > > > I0510 04:22:37.967418 13621 cgroups.cpp:1175] Trying
> > to
> > > > > freeze
> > > > > > >> > cgroup
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > After the last line in the log, both the executing
> task
> > > and
> > > > > the
> > > > > > >> mesos
> > > > > > >> > > > > process are stuck indefinitely.  Usually the way to
> > > resolve
> > > > > this
> > > > > > >> is
> > > > > > >> > > > > `killall -9 java lt-mesos-slave' or a reboot.
> > > > > > >> > > > >
> > > > > > >> > > > > If I cat /cgroup/mesos/framework.../freezer.state, it
> > > shows
> > > > > that
> > > > > > >> it's
> > > > > > >> > > > > freezing.  Since the entire system is stuck, I cannot
> > > > attach a
> > > > > > >> > debugger
> > > > > > >> > > > to
> > > > > > >> > > > > either the JVM or mesos.
> > > > > > >> > > > >
> > > > > > >> > > > > Any ideas?  This problem is very frustrating and I'm
> > stuck
> > > > at
> > > > > > this
> > > > > > >> > > point.
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: OOM killer stuck when freezing cgroup

Reply via email to