I'll run some tests over the weekend and see how it goes. I've made 2 small changes: increasing the JVM overhead in the scheduler, and trying this kill -> freeze -> kill -> thaw pattern. When I kill the process manually the system recovers, so I'm willing to recon this is a reasonably correct order of events. I find it very unlikely that there is actually a kernel bug here.
On Fri, May 10, 2013 at 4:10 PM, Benjamin Mahler <[email protected]>wrote: > The purpose of first freezing the cgroup is to attempt to atomically kill > of the group. If there's an issue in freezing a cgroup, that's problematic. > It's possible the kernel is misbehaving, or our code is misbehaving, but we > should try to figure out what's causing the freezing issue. > > Sending SIGKILL to all the pids without at least stopping them is going to > be incorrect, we have a killtree.sh that you can look at to get an idea of > how we stop each process first (when using process isolation). For cgroups > isolation, the freezer makes this stopping process simple for us :) > > > On Fri, May 10, 2013 at 3:58 PM, Brenden Matthews < > [email protected]> wrote: > > > I can only perform tasks which don't touch the same partition that the > > slave read/writes from (i.e., /tmp). It seems that the entire device > > becomes locked and I'm not able to touch it. > > > > What about doing the following sequence: > > > > killTasks(): > > > > chain = kill() // Kill current tasks. > > .then(defer(self(), &Self::freeze)) // Freeze the cgroup. > > .then(defer(self(), &Self::kill)) // Kill any remaining (i.e., > > recently forked) tasks. > > .then(defer(self(), &Self::thaw)) // Thaw cgroup to deliver > > signals. > > .then(defer(self(), &Self::empty)); // Wait until cgroup is empty. > > > > Perhaps this will help to un-stick the stuck processes before trying to > > freeze the whole thing? I might try it and see how it goes. > > > > > > On Fri, May 10, 2013 at 2:58 PM, Vinod Kone <[email protected]> wrote: > > > > > Hi Brenden, > > > > > > The reason we do a freeze is so that no new processes are forked while > we > > > are killing them. For what it is worth, we have been running cgroups in > > > production at twitter and I don't remember seeing this issue. Though, I > > > must admit, we haven't run Hadoop Tasktrackers with cgroups here. So, I > > > would like to understand if this is a mesos/kernel bug before we plan a > > > fix. > > > > > > IIUC, you were not able to gdb into the mesos slave when it is stuck? > Is > > > this because the machine itself is hung? > > > > > > > > > On Fri, May 10, 2013 at 2:40 PM, Brenden Matthews < > > > [email protected]> wrote: > > > > > > > I think that the kernel logs only reveal what we already know: some > > > process > > > > is stuck because it's blocked on a disk read/write. It doesn't > explain > > > why > > > > it's blocked, or what's blocking it. I suspect that the freezer > itself > > > is > > > > the culprit. When we want to kill a process, maybe we should just > kill > > > it > > > > rather than freezing it and then killing it? > > > > > > > > > > > > On Fri, May 10, 2013 at 2:26 PM, Benjamin Mahler > > > > <[email protected]>wrote: > > > > > > > > > The kernel logs sound concerning to me and a cursory google search > > > > reveals > > > > > some kernel bug tickets. Is there a time correlation there? > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 1:13 PM, Brenden Matthews < > > > > > [email protected]> wrote: > > > > > > > > > > > So after cat'ing around in /proc, I can see that the JVM is > > blocking > > > on > > > > > > disk I/O apparently: > > > > > > > > > > > > root@i-bde90adc:~# cat /proc/24689/status > > > > > > > Name: java > > > > > > > State: D (disk sleep) > > > > > > > Tgid: 24689 > > > > > > > Pid: 24689 > > > > > > > PPid: 24629 > > > > > > > TracerPid: 0 > > > > > > > Uid: 997 997 997 997 > > > > > > > Gid: 997 997 997 997 > > > > > > > FDSize: 512 > > > > > > > Groups: > > > > > > > VmPeak: 2198700 kB > > > > > > > VmSize: 2196016 kB > > > > > > > VmLck: 0 kB > > > > > > > VmPin: 0 kB > > > > > > > VmHWM: 305276 kB > > > > > > > VmRSS: 296828 kB > > > > > > > VmData: 2040128 kB > > > > > > > VmStk: 156 kB > > > > > > > VmExe: 4 kB > > > > > > > VmLib: 32692 kB > > > > > > > VmPTE: 1292 kB > > > > > > > VmSwap: 0 kB > > > > > > > Threads: 104 > > > > > > > SigQ: 0/180427 > > > > > > > SigPnd: 0000000000000000 > > > > > > > ShdPnd: 0000000000000000 > > > > > > > SigBlk: 0000000000000000 > > > > > > > SigIgn: 0000000000000000 > > > > > > > SigCgt: 2000000181005ccf > > > > > > > CapInh: 0000000000000000 > > > > > > > CapPrm: 0000000000000000 > > > > > > > CapEff: 0000000000000000 > > > > > > > CapBnd: ffffffffffffffff > > > > > > > Cpus_allowed: ffff > > > > > > > Cpus_allowed_list: 0-15 > > > > > > > Mems_allowed: 00000000,00000001 > > > > > > > Mems_allowed_list: 0 > > > > > > > voluntary_ctxt_switches: 25 > > > > > > > nonvoluntary_ctxt_switches: 0 > > > > > > > > > > > > > > > > > > > > > > > > Mesos looks okay: > > > > > > > > > > > > root@i-bde90adc:~# cat /proc/3684/status > > > > > > > Name: lt-mesos-slave > > > > > > > State: S (sleeping) > > > > > > > Tgid: 3684 > > > > > > > Pid: 3684 > > > > > > > PPid: 1131 > > > > > > > TracerPid: 0 > > > > > > > Uid: 0 0 0 0 > > > > > > > Gid: 0 0 0 0 > > > > > > > FDSize: 256 > > > > > > > Groups: > > > > > > > VmPeak: 1531868 kB > > > > > > > VmSize: 1531864 kB > > > > > > > VmLck: 0 kB > > > > > > > VmPin: 0 kB > > > > > > > VmHWM: 15272 kB > > > > > > > VmRSS: 15272 kB > > > > > > > VmData: 1402012 kB > > > > > > > VmStk: 136 kB > > > > > > > VmExe: 152 kB > > > > > > > VmLib: 31824 kB > > > > > > > VmPTE: 416 kB > > > > > > > VmSwap: 0 kB > > > > > > > Threads: 20 > > > > > > > SigQ: 1/180427 > > > > > > > SigPnd: 0000000000000000 > > > > > > > ShdPnd: 0000000000000000 > > > > > > > SigBlk: 0000000000000000 > > > > > > > SigIgn: 0000000000000000 > > > > > > > SigCgt: 0000000180000000 > > > > > > > CapInh: 0000000000000000 > > > > > > > CapPrm: ffffffffffffffff > > > > > > > CapEff: ffffffffffffffff > > > > > > > CapBnd: ffffffffffffffff > > > > > > > Cpus_allowed: ffff > > > > > > > Cpus_allowed_list: 0-15 > > > > > > > Mems_allowed: 00000000,00000001 > > > > > > > Mems_allowed_list: 0 > > > > > > > voluntary_ctxt_switches: 69 > > > > > > > nonvoluntary_ctxt_switches: 0 > > > > > > > > > > > > > > > > > > Is it possible that changing the state of the freezer from thawed > > -> > > > > > frozen > > > > > > causes all disk I/O to stop? I'm not sure where to look to > > mitigate > > > > > this. > > > > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 11:58 AM, Brenden Matthews < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > The kernel gives me a bunch of "soft lockup: CPU stuck!" and > > 'task > > > > java > > > > > > > blocked for more than 120 seconds". I think these are fairly > > > generic > > > > > > > errors and can be caused by any sort of race condition on the > > > system. > > > > > > > Strace also isn't helpful, as it doesn't tell me which > resource > > > the > > > > > > > processes are stuck on. > > > > > > > > > > > > > > I'll try and check /proc/pid/status next time it happens. > Other > > > > tools > > > > > > > like lsof are unable to run. > > > > > > > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 11:49 AM, Vinod Kone < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > >> ---------- Forwarded message ---------- > > > > > > >> From: Vinod Kone <[email protected]> > > > > > > >> Date: Fri, May 10, 2013 at 11:25 AM > > > > > > >> Subject: Re: OOM killer stuck when freezing cgroup > > > > > > >> To: "[email protected]" < > > > > [email protected]> > > > > > > >> > > > > > > >> > > > > > > >> Anything in the kernel logs during this time? > > > > > > >> > > > > > > >> > > > > > > >> @vinodkone > > > > > > >> > > > > > > >> > > > > > > >> On Fri, May 10, 2013 at 11:13 AM, Benjamin Mahler < > > > > > > >> [email protected] > > > > > > >> > wrote: > > > > > > >> > > > > > > >> > Sorry about that miscommunication, I meant to say "from the > > > kernel > > > > > OOM > > > > > > >> > killer", and you're right that the OOM shouldn't be the > source > > > of > > > > > the > > > > > > >> > problem here (just trying to make a mental note that our > > > > accounting > > > > > is > > > > > > >> > wrong), OOMs are indeed expected as part of normal operation > > on > > > > the > > > > > > >> slave. > > > > > > >> > > > > > > > >> > It's tough to say what the issue is as I don't have any > > > experience > > > > > > >> running > > > > > > >> > on newer kernels like 3.2, I would recommend augmenting the > > > > logging > > > > > in > > > > > > >> the > > > > > > >> > cgroups::Freezer to try to nail down at which point the > slave > > is > > > > > > >> blocking. > > > > > > >> > In particular, in Freezer::freeze() and > > Freezer::watchFrozen(). > > > > > > >> > > > > > > > >> > Also, do you have /proc/<slave_pid>/status output for the > > slave > > > > when > > > > > > >> it's > > > > > > >> > frozen? > > > > > > >> > > > > > > > >> > It will be really useful to know if that is consistently the > > > last > > > > > log > > > > > > >> line > > > > > > >> > you see before things lock up as well, as that helps narrow > > this > > > > > down. > > > > > > >> > > > > > > > >> > > > > > > > >> > On Fri, May 10, 2013 at 10:44 AM, Brenden Matthews < > > > > > > >> > [email protected]> wrote: > > > > > > >> > > > > > > > >> > > This isn't a kernel OOM. I started the debugger and let > it > > go > > > > > > >> overnight > > > > > > >> > > with no luck. It just doesn't want to budge. I'm aware > of > > > the > > > > > > memory > > > > > > >> > > limit issue, but I don't think that's the real problem > here. > > > > > > >> Regardless > > > > > > >> > of > > > > > > >> > > whether we run out of memory, the whole system shouldn't > > > freeze > > > > > when > > > > > > >> > trying > > > > > > >> > > to kill the process. I'm not sure if this is a kernel > > problem > > > > or > > > > > a > > > > > > >> Mesos > > > > > > >> > > problem, but I suspect that the issue isn't with the > kernel. > > > > > > Walking > > > > > > >> > > through the Mesos code, I can't see anything that is > > obviously > > > > > wrong > > > > > > >> > > either. I'm using my own branch from here: > > > > > > >> > > https://github.com/airbnb/mesos/tree/testing. > > > > > > >> > > > > > > > > >> > > My next step will be to run a bunch of slaves in GDB and > > then > > > > > launch > > > > > > >> some > > > > > > >> > > tasks. Hopefully I can stop them and get a backtrace. I > > > > thought > > > > > it > > > > > > >> > might > > > > > > >> > > be worth asking about it here to see if anyone else has > > dealt > > > > with > > > > > > >> this > > > > > > >> > and > > > > > > >> > > knows what's going on. > > > > > > >> > > > > > > > > >> > > These are EC2 instances running the latest Ubuntu Precise > > > AMIs. > > > > > > >> Here's > > > > > > >> > the > > > > > > >> > > `uname -a' from one: > > > > > > >> > > > > > > > > >> > > Linux i-b5dbc1d5 3.2.0-41-virtual #66-Ubuntu SMP Thu Apr > 25 > > > > > 03:47:17 > > > > > > >> UTC > > > > > > >> > > 2013 x86_64 x86_64 x86_64 GNU/Linux > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > On Fri, May 10, 2013 at 10:21 AM, Benjamin Mahler < > > > > > > >> > > [email protected] > > > > > > >> > > > wrote: > > > > > > >> > > > > > > > > >> > > > Hey Brenden, what kernel version are you running? What > > > version > > > > > of > > > > > > >> the > > > > > > >> > > code > > > > > > >> > > > are you running? Are you seeing this on different > > machines? > > > > Can > > > > > > you > > > > > > >> > > confirm > > > > > > >> > > > you're unable to attach to the slave with gdb (that > would > > be > > > > the > > > > > > >> > easiest > > > > > > >> > > > way to figure out what's blocking in the slave)? Is that > > > > always > > > > > > the > > > > > > >> > last > > > > > > >> > > > line in the log before the processes get stuck? > > > > > > >> > > > > > > > > > >> > > > As an aside, it looks like we aren't properly setting > the > > > JVM > > > > > heap > > > > > > >> size > > > > > > >> > > for > > > > > > >> > > > TaskTrackers in the MesosScheduler for Hadoop. We try to > > > > > estimate > > > > > > >> the > > > > > > >> > JVM > > > > > > >> > > > memory overhead, but it was perhaps not enough to avoid > > > kernel > > > > > > OOMs. > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews < > > > > > > >> > > > [email protected]> wrote: > > > > > > >> > > > > > > > > > >> > > > > Hey folks, > > > > > > >> > > > > > > > > > > >> > > > > I'm bumping in to a problem frequently. When an > > executor > > > > > > exceeds > > > > > > >> its > > > > > > >> > > > > memory limit and the executor is deemed in need of > > > > execution, > > > > > > the > > > > > > >> > > entire > > > > > > >> > > > > system becomes stuck (though I'm unsure as to whether > > it's > > > > > stuck > > > > > > >> > before > > > > > > >> > > > or > > > > > > >> > > > > after attempting to freeze). Here's what the mesos > > slave > > > > log > > > > > > >> looks > > > > > > >> > > like: > > > > > > >> > > > > > > > > > > >> > > > > I0510 04:22:37.935608 13622 cgroups_isolator.cpp:1023] > > OOM > > > > > > >> notifier > > > > > > >> > is > > > > > > >> > > > > > triggered for executor executor_Task_Tracker_446 of > > > > > framework > > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid > > > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0 > > > > > > >> > > > > > I0510 04:22:37.935925 13622 > cgroups_isolator.cpp:1068] > > > OOM > > > > > > >> detected > > > > > > >> > > for > > > > > > >> > > > > > executor executor_Task_Tracker_446 of framework > > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid > > > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0 > > > > > > >> > > > > > I0510 04:22:37.936982 13622 > cgroups_isolator.cpp:1109] > > > > > Memory > > > > > > >> limit > > > > > > >> > > > > > exceeded: Requested: 2570MB Used: 2570MB > > > > > > >> > > > > > MEMORY STATISTICS: > > > > > > >> > > > > > cache 1682083840 > > > > > > >> > > > > > rss 1012756480 > > > > > > >> > > > > > mapped_file 1085440 > > > > > > >> > > > > > pgpgin 22863635 > > > > > > >> > > > > > pgpgout 22205715 > > > > > > >> > > > > > swap 0 > > > > > > >> > > > > > pgfault 11881912 > > > > > > >> > > > > > pgmajfault 304 > > > > > > >> > > > > > inactive_anon 0 > > > > > > >> > > > > > active_anon 1012756480 > > > > > > >> > > > > > inactive_file 1065136128 > > > > > > >> > > > > > active_file 616947712 > > > > > > >> > > > > > unevictable 0 > > > > > > >> > > > > > hierarchical_memory_limit 2694840320 > > > > > > >> > > > > > hierarchical_memsw_limit 9223372036854775807 > > > > > > >> > > > > > total_cache 1682083840 > > > > > > >> > > > > > total_rss 1012756480 > > > > > > >> > > > > > total_mapped_file 1085440 > > > > > > >> > > > > > total_pgpgin 22863635 > > > > > > >> > > > > > total_pgpgout 22205715 > > > > > > >> > > > > > total_swap 0 > > > > > > >> > > > > > total_pgfault 11881912 > > > > > > >> > > > > > total_pgmajfault 304 > > > > > > >> > > > > > total_inactive_anon 0 > > > > > > >> > > > > > total_active_anon 1012756480 > > > > > > >> > > > > > total_inactive_file 1065136128 > > > > > > >> > > > > > total_active_file 616947712 > > > > > > >> > > > > > total_unevictable 0 > > > > > > >> > > > > > I0510 04:22:37.965666 13622 > cgroups_isolator.cpp:620] > > > > > Killing > > > > > > >> > > executor > > > > > > >> > > > > > executor_Task_Tracker_446 of framework > > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 > > > > > > >> > > > > > I0510 04:22:37.967418 13621 cgroups.cpp:1175] Trying > > to > > > > > freeze > > > > > > >> > cgroup > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0 > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > After the last line in the log, both the executing > task > > > and > > > > > the > > > > > > >> mesos > > > > > > >> > > > > process are stuck indefinitely. Usually the way to > > > resolve > > > > > this > > > > > > >> is > > > > > > >> > > > > `killall -9 java lt-mesos-slave' or a reboot. > > > > > > >> > > > > > > > > > > >> > > > > If I cat /cgroup/mesos/framework.../freezer.state, it > > > shows > > > > > that > > > > > > >> it's > > > > > > >> > > > > freezing. Since the entire system is stuck, I cannot > > > > attach a > > > > > > >> > debugger > > > > > > >> > > > to > > > > > > >> > > > > either the JVM or mesos. > > > > > > >> > > > > > > > > > > >> > > > > Any ideas? This problem is very frustrating and I'm > > stuck > > > > at > > > > > > this > > > > > > >> > > point. > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
