I'd also recommend augmenting the logging in the Freezer, freezing should not cause the system to hang so we'll have to figure out whether this is an issue with our code, the kernel version you're running, or with some other interaction here.
FWIW we have encountered several kernel bugs so far depending on which kernel version is in use, typically relating to cgroups. I suppose this is because it's one of the less used components of the kernel. All this to say, don't rule it out just yet :) On Fri, May 10, 2013 at 4:23 PM, Brenden Matthews < [email protected]> wrote: > I'll run some tests over the weekend and see how it goes. I've made 2 > small changes: increasing the JVM overhead in the scheduler, and trying > this kill -> freeze -> kill -> thaw pattern. When I kill the process > manually the system recovers, so I'm willing to recon this is a reasonably > correct order of events. I find it very unlikely that there is actually a > kernel bug here. > > > On Fri, May 10, 2013 at 4:10 PM, Benjamin Mahler > <[email protected]>wrote: > > > The purpose of first freezing the cgroup is to attempt to atomically kill > > of the group. If there's an issue in freezing a cgroup, that's > problematic. > > It's possible the kernel is misbehaving, or our code is misbehaving, but > we > > should try to figure out what's causing the freezing issue. > > > > Sending SIGKILL to all the pids without at least stopping them is going > to > > be incorrect, we have a killtree.sh that you can look at to get an idea > of > > how we stop each process first (when using process isolation). For > cgroups > > isolation, the freezer makes this stopping process simple for us :) > > > > > > On Fri, May 10, 2013 at 3:58 PM, Brenden Matthews < > > [email protected]> wrote: > > > > > I can only perform tasks which don't touch the same partition that the > > > slave read/writes from (i.e., /tmp). It seems that the entire device > > > becomes locked and I'm not able to touch it. > > > > > > What about doing the following sequence: > > > > > > killTasks(): > > > > > > chain = kill() // Kill current tasks. > > > .then(defer(self(), &Self::freeze)) // Freeze the cgroup. > > > .then(defer(self(), &Self::kill)) // Kill any remaining (i.e., > > > recently forked) tasks. > > > .then(defer(self(), &Self::thaw)) // Thaw cgroup to deliver > > > signals. > > > .then(defer(self(), &Self::empty)); // Wait until cgroup is > empty. > > > > > > Perhaps this will help to un-stick the stuck processes before trying to > > > freeze the whole thing? I might try it and see how it goes. > > > > > > > > > On Fri, May 10, 2013 at 2:58 PM, Vinod Kone <[email protected]> > wrote: > > > > > > > Hi Brenden, > > > > > > > > The reason we do a freeze is so that no new processes are forked > while > > we > > > > are killing them. For what it is worth, we have been running cgroups > in > > > > production at twitter and I don't remember seeing this issue. > Though, I > > > > must admit, we haven't run Hadoop Tasktrackers with cgroups here. > So, I > > > > would like to understand if this is a mesos/kernel bug before we > plan a > > > > fix. > > > > > > > > IIUC, you were not able to gdb into the mesos slave when it is stuck? > > Is > > > > this because the machine itself is hung? > > > > > > > > > > > > On Fri, May 10, 2013 at 2:40 PM, Brenden Matthews < > > > > [email protected]> wrote: > > > > > > > > > I think that the kernel logs only reveal what we already know: some > > > > process > > > > > is stuck because it's blocked on a disk read/write. It doesn't > > explain > > > > why > > > > > it's blocked, or what's blocking it. I suspect that the freezer > > itself > > > > is > > > > > the culprit. When we want to kill a process, maybe we should just > > kill > > > > it > > > > > rather than freezing it and then killing it? > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 2:26 PM, Benjamin Mahler > > > > > <[email protected]>wrote: > > > > > > > > > > > The kernel logs sound concerning to me and a cursory google > search > > > > > reveals > > > > > > some kernel bug tickets. Is there a time correlation there? > > > > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 1:13 PM, Brenden Matthews < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > So after cat'ing around in /proc, I can see that the JVM is > > > blocking > > > > on > > > > > > > disk I/O apparently: > > > > > > > > > > > > > > root@i-bde90adc:~# cat /proc/24689/status > > > > > > > > Name: java > > > > > > > > State: D (disk sleep) > > > > > > > > Tgid: 24689 > > > > > > > > Pid: 24689 > > > > > > > > PPid: 24629 > > > > > > > > TracerPid: 0 > > > > > > > > Uid: 997 997 997 997 > > > > > > > > Gid: 997 997 997 997 > > > > > > > > FDSize: 512 > > > > > > > > Groups: > > > > > > > > VmPeak: 2198700 kB > > > > > > > > VmSize: 2196016 kB > > > > > > > > VmLck: 0 kB > > > > > > > > VmPin: 0 kB > > > > > > > > VmHWM: 305276 kB > > > > > > > > VmRSS: 296828 kB > > > > > > > > VmData: 2040128 kB > > > > > > > > VmStk: 156 kB > > > > > > > > VmExe: 4 kB > > > > > > > > VmLib: 32692 kB > > > > > > > > VmPTE: 1292 kB > > > > > > > > VmSwap: 0 kB > > > > > > > > Threads: 104 > > > > > > > > SigQ: 0/180427 > > > > > > > > SigPnd: 0000000000000000 > > > > > > > > ShdPnd: 0000000000000000 > > > > > > > > SigBlk: 0000000000000000 > > > > > > > > SigIgn: 0000000000000000 > > > > > > > > SigCgt: 2000000181005ccf > > > > > > > > CapInh: 0000000000000000 > > > > > > > > CapPrm: 0000000000000000 > > > > > > > > CapEff: 0000000000000000 > > > > > > > > CapBnd: ffffffffffffffff > > > > > > > > Cpus_allowed: ffff > > > > > > > > Cpus_allowed_list: 0-15 > > > > > > > > Mems_allowed: 00000000,00000001 > > > > > > > > Mems_allowed_list: 0 > > > > > > > > voluntary_ctxt_switches: 25 > > > > > > > > nonvoluntary_ctxt_switches: 0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mesos looks okay: > > > > > > > > > > > > > > root@i-bde90adc:~# cat /proc/3684/status > > > > > > > > Name: lt-mesos-slave > > > > > > > > State: S (sleeping) > > > > > > > > Tgid: 3684 > > > > > > > > Pid: 3684 > > > > > > > > PPid: 1131 > > > > > > > > TracerPid: 0 > > > > > > > > Uid: 0 0 0 0 > > > > > > > > Gid: 0 0 0 0 > > > > > > > > FDSize: 256 > > > > > > > > Groups: > > > > > > > > VmPeak: 1531868 kB > > > > > > > > VmSize: 1531864 kB > > > > > > > > VmLck: 0 kB > > > > > > > > VmPin: 0 kB > > > > > > > > VmHWM: 15272 kB > > > > > > > > VmRSS: 15272 kB > > > > > > > > VmData: 1402012 kB > > > > > > > > VmStk: 136 kB > > > > > > > > VmExe: 152 kB > > > > > > > > VmLib: 31824 kB > > > > > > > > VmPTE: 416 kB > > > > > > > > VmSwap: 0 kB > > > > > > > > Threads: 20 > > > > > > > > SigQ: 1/180427 > > > > > > > > SigPnd: 0000000000000000 > > > > > > > > ShdPnd: 0000000000000000 > > > > > > > > SigBlk: 0000000000000000 > > > > > > > > SigIgn: 0000000000000000 > > > > > > > > SigCgt: 0000000180000000 > > > > > > > > CapInh: 0000000000000000 > > > > > > > > CapPrm: ffffffffffffffff > > > > > > > > CapEff: ffffffffffffffff > > > > > > > > CapBnd: ffffffffffffffff > > > > > > > > Cpus_allowed: ffff > > > > > > > > Cpus_allowed_list: 0-15 > > > > > > > > Mems_allowed: 00000000,00000001 > > > > > > > > Mems_allowed_list: 0 > > > > > > > > voluntary_ctxt_switches: 69 > > > > > > > > nonvoluntary_ctxt_switches: 0 > > > > > > > > > > > > > > > > > > > > > Is it possible that changing the state of the freezer from > thawed > > > -> > > > > > > frozen > > > > > > > causes all disk I/O to stop? I'm not sure where to look to > > > mitigate > > > > > > this. > > > > > > > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 11:58 AM, Brenden Matthews < > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > The kernel gives me a bunch of "soft lockup: CPU stuck!" and > > > 'task > > > > > java > > > > > > > > blocked for more than 120 seconds". I think these are fairly > > > > generic > > > > > > > > errors and can be caused by any sort of race condition on the > > > > system. > > > > > > > > Strace also isn't helpful, as it doesn't tell me which > > resource > > > > the > > > > > > > > processes are stuck on. > > > > > > > > > > > > > > > > I'll try and check /proc/pid/status next time it happens. > > Other > > > > > tools > > > > > > > > like lsof are unable to run. > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 10, 2013 at 11:49 AM, Vinod Kone < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > >> ---------- Forwarded message ---------- > > > > > > > >> From: Vinod Kone <[email protected]> > > > > > > > >> Date: Fri, May 10, 2013 at 11:25 AM > > > > > > > >> Subject: Re: OOM killer stuck when freezing cgroup > > > > > > > >> To: "[email protected]" < > > > > > [email protected]> > > > > > > > >> > > > > > > > >> > > > > > > > >> Anything in the kernel logs during this time? > > > > > > > >> > > > > > > > >> > > > > > > > >> @vinodkone > > > > > > > >> > > > > > > > >> > > > > > > > >> On Fri, May 10, 2013 at 11:13 AM, Benjamin Mahler < > > > > > > > >> [email protected] > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > >> > Sorry about that miscommunication, I meant to say "from > the > > > > kernel > > > > > > OOM > > > > > > > >> > killer", and you're right that the OOM shouldn't be the > > source > > > > of > > > > > > the > > > > > > > >> > problem here (just trying to make a mental note that our > > > > > accounting > > > > > > is > > > > > > > >> > wrong), OOMs are indeed expected as part of normal > operation > > > on > > > > > the > > > > > > > >> slave. > > > > > > > >> > > > > > > > > >> > It's tough to say what the issue is as I don't have any > > > > experience > > > > > > > >> running > > > > > > > >> > on newer kernels like 3.2, I would recommend augmenting > the > > > > > logging > > > > > > in > > > > > > > >> the > > > > > > > >> > cgroups::Freezer to try to nail down at which point the > > slave > > > is > > > > > > > >> blocking. > > > > > > > >> > In particular, in Freezer::freeze() and > > > Freezer::watchFrozen(). > > > > > > > >> > > > > > > > > >> > Also, do you have /proc/<slave_pid>/status output for the > > > slave > > > > > when > > > > > > > >> it's > > > > > > > >> > frozen? > > > > > > > >> > > > > > > > > >> > It will be really useful to know if that is consistently > the > > > > last > > > > > > log > > > > > > > >> line > > > > > > > >> > you see before things lock up as well, as that helps > narrow > > > this > > > > > > down. > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > On Fri, May 10, 2013 at 10:44 AM, Brenden Matthews < > > > > > > > >> > [email protected]> wrote: > > > > > > > >> > > > > > > > > >> > > This isn't a kernel OOM. I started the debugger and let > > it > > > go > > > > > > > >> overnight > > > > > > > >> > > with no luck. It just doesn't want to budge. I'm aware > > of > > > > the > > > > > > > memory > > > > > > > >> > > limit issue, but I don't think that's the real problem > > here. > > > > > > > >> Regardless > > > > > > > >> > of > > > > > > > >> > > whether we run out of memory, the whole system shouldn't > > > > freeze > > > > > > when > > > > > > > >> > trying > > > > > > > >> > > to kill the process. I'm not sure if this is a kernel > > > problem > > > > > or > > > > > > a > > > > > > > >> Mesos > > > > > > > >> > > problem, but I suspect that the issue isn't with the > > kernel. > > > > > > > Walking > > > > > > > >> > > through the Mesos code, I can't see anything that is > > > obviously > > > > > > wrong > > > > > > > >> > > either. I'm using my own branch from here: > > > > > > > >> > > https://github.com/airbnb/mesos/tree/testing. > > > > > > > >> > > > > > > > > > >> > > My next step will be to run a bunch of slaves in GDB and > > > then > > > > > > launch > > > > > > > >> some > > > > > > > >> > > tasks. Hopefully I can stop them and get a backtrace. > I > > > > > thought > > > > > > it > > > > > > > >> > might > > > > > > > >> > > be worth asking about it here to see if anyone else has > > > dealt > > > > > with > > > > > > > >> this > > > > > > > >> > and > > > > > > > >> > > knows what's going on. > > > > > > > >> > > > > > > > > > >> > > These are EC2 instances running the latest Ubuntu > Precise > > > > AMIs. > > > > > > > >> Here's > > > > > > > >> > the > > > > > > > >> > > `uname -a' from one: > > > > > > > >> > > > > > > > > > >> > > Linux i-b5dbc1d5 3.2.0-41-virtual #66-Ubuntu SMP Thu Apr > > 25 > > > > > > 03:47:17 > > > > > > > >> UTC > > > > > > > >> > > 2013 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > On Fri, May 10, 2013 at 10:21 AM, Benjamin Mahler < > > > > > > > >> > > [email protected] > > > > > > > >> > > > wrote: > > > > > > > >> > > > > > > > > > >> > > > Hey Brenden, what kernel version are you running? What > > > > version > > > > > > of > > > > > > > >> the > > > > > > > >> > > code > > > > > > > >> > > > are you running? Are you seeing this on different > > > machines? > > > > > Can > > > > > > > you > > > > > > > >> > > confirm > > > > > > > >> > > > you're unable to attach to the slave with gdb (that > > would > > > be > > > > > the > > > > > > > >> > easiest > > > > > > > >> > > > way to figure out what's blocking in the slave)? Is > that > > > > > always > > > > > > > the > > > > > > > >> > last > > > > > > > >> > > > line in the log before the processes get stuck? > > > > > > > >> > > > > > > > > > > >> > > > As an aside, it looks like we aren't properly setting > > the > > > > JVM > > > > > > heap > > > > > > > >> size > > > > > > > >> > > for > > > > > > > >> > > > TaskTrackers in the MesosScheduler for Hadoop. We try > to > > > > > > estimate > > > > > > > >> the > > > > > > > >> > JVM > > > > > > > >> > > > memory overhead, but it was perhaps not enough to > avoid > > > > kernel > > > > > > > OOMs. > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews < > > > > > > > >> > > > [email protected]> wrote: > > > > > > > >> > > > > > > > > > > >> > > > > Hey folks, > > > > > > > >> > > > > > > > > > > > >> > > > > I'm bumping in to a problem frequently. When an > > > executor > > > > > > > exceeds > > > > > > > >> its > > > > > > > >> > > > > memory limit and the executor is deemed in need of > > > > > execution, > > > > > > > the > > > > > > > >> > > entire > > > > > > > >> > > > > system becomes stuck (though I'm unsure as to > whether > > > it's > > > > > > stuck > > > > > > > >> > before > > > > > > > >> > > > or > > > > > > > >> > > > > after attempting to freeze). Here's what the mesos > > > slave > > > > > log > > > > > > > >> looks > > > > > > > >> > > like: > > > > > > > >> > > > > > > > > > > > >> > > > > I0510 04:22:37.935608 13622 > cgroups_isolator.cpp:1023] > > > OOM > > > > > > > >> notifier > > > > > > > >> > is > > > > > > > >> > > > > > triggered for executor executor_Task_Tracker_446 > of > > > > > > framework > > > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid > > > > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0 > > > > > > > >> > > > > > I0510 04:22:37.935925 13622 > > cgroups_isolator.cpp:1068] > > > > OOM > > > > > > > >> detected > > > > > > > >> > > for > > > > > > > >> > > > > > executor executor_Task_Tracker_446 of framework > > > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 with uuid > > > > > > > >> > > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0 > > > > > > > >> > > > > > I0510 04:22:37.936982 13622 > > cgroups_isolator.cpp:1109] > > > > > > Memory > > > > > > > >> limit > > > > > > > >> > > > > > exceeded: Requested: 2570MB Used: 2570MB > > > > > > > >> > > > > > MEMORY STATISTICS: > > > > > > > >> > > > > > cache 1682083840 > > > > > > > >> > > > > > rss 1012756480 > > > > > > > >> > > > > > mapped_file 1085440 > > > > > > > >> > > > > > pgpgin 22863635 > > > > > > > >> > > > > > pgpgout 22205715 > > > > > > > >> > > > > > swap 0 > > > > > > > >> > > > > > pgfault 11881912 > > > > > > > >> > > > > > pgmajfault 304 > > > > > > > >> > > > > > inactive_anon 0 > > > > > > > >> > > > > > active_anon 1012756480 > > > > > > > >> > > > > > inactive_file 1065136128 > > > > > > > >> > > > > > active_file 616947712 > > > > > > > >> > > > > > unevictable 0 > > > > > > > >> > > > > > hierarchical_memory_limit 2694840320 > > > > > > > >> > > > > > hierarchical_memsw_limit 9223372036854775807 > > > > > > > >> > > > > > total_cache 1682083840 > > > > > > > >> > > > > > total_rss 1012756480 > > > > > > > >> > > > > > total_mapped_file 1085440 > > > > > > > >> > > > > > total_pgpgin 22863635 > > > > > > > >> > > > > > total_pgpgout 22205715 > > > > > > > >> > > > > > total_swap 0 > > > > > > > >> > > > > > total_pgfault 11881912 > > > > > > > >> > > > > > total_pgmajfault 304 > > > > > > > >> > > > > > total_inactive_anon 0 > > > > > > > >> > > > > > total_active_anon 1012756480 > > > > > > > >> > > > > > total_inactive_file 1065136128 > > > > > > > >> > > > > > total_active_file 616947712 > > > > > > > >> > > > > > total_unevictable 0 > > > > > > > >> > > > > > I0510 04:22:37.965666 13622 > > cgroups_isolator.cpp:620] > > > > > > Killing > > > > > > > >> > > executor > > > > > > > >> > > > > > executor_Task_Tracker_446 of framework > > > > > > > >> > > > > > 201305100055-1471680778-5050-18199-0000 > > > > > > > >> > > > > > I0510 04:22:37.967418 13621 cgroups.cpp:1175] > Trying > > > to > > > > > > freeze > > > > > > > >> > cgroup > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0 > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > After the last line in the log, both the executing > > task > > > > and > > > > > > the > > > > > > > >> mesos > > > > > > > >> > > > > process are stuck indefinitely. Usually the way to > > > > resolve > > > > > > this > > > > > > > >> is > > > > > > > >> > > > > `killall -9 java lt-mesos-slave' or a reboot. > > > > > > > >> > > > > > > > > > > > >> > > > > If I cat /cgroup/mesos/framework.../freezer.state, > it > > > > shows > > > > > > that > > > > > > > >> it's > > > > > > > >> > > > > freezing. Since the entire system is stuck, I > cannot > > > > > attach a > > > > > > > >> > debugger > > > > > > > >> > > > to > > > > > > > >> > > > > either the JVM or mesos. > > > > > > > >> > > > > > > > > > > > >> > > > > Any ideas? This problem is very frustrating and I'm > > > stuck > > > > > at > > > > > > > this > > > > > > > >> > > point. > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
