Anything in the kernel logs during this time?

@vinodkone


On Fri, May 10, 2013 at 11:13 AM, Benjamin Mahler <[email protected]
> wrote:

> Sorry about that miscommunication, I meant to say "from the kernel OOM
> killer", and you're right that the OOM shouldn't be the source of the
> problem here (just trying to make a mental note that our accounting is
> wrong), OOMs are indeed expected as part of normal operation on the slave.
>
> It's tough to say what the issue is as I don't have any experience running
> on newer kernels like 3.2, I would recommend augmenting the logging in the
> cgroups::Freezer to try to nail down at which point the slave is blocking.
> In particular, in Freezer::freeze() and Freezer::watchFrozen().
>
> Also, do you have /proc/<slave_pid>/status output for the slave when it's
> frozen?
>
> It will be really useful to know if that is consistently the last log line
> you see before things lock up as well, as that helps narrow this down.
>
>
> On Fri, May 10, 2013 at 10:44 AM, Brenden Matthews <
> [email protected]> wrote:
>
> > This isn't a kernel OOM.  I started the debugger and let it go overnight
> > with no luck.  It just doesn't want to budge.  I'm aware of the memory
> > limit issue, but I don't think that's the real problem here.  Regardless
> of
> > whether we run out of memory, the whole system shouldn't freeze when
> trying
> > to kill the process.  I'm not sure if this is a kernel problem or a Mesos
> > problem, but I suspect that the issue isn't with the kernel.  Walking
> > through the Mesos code, I can't see anything that is obviously wrong
> > either.  I'm using my own branch from here:
> > https://github.com/airbnb/mesos/tree/testing.
> >
> > My next step will be to run a bunch of slaves in GDB and then launch some
> > tasks.  Hopefully I can stop them and get a backtrace.  I thought it
> might
> > be worth asking about it here to see if anyone else has dealt with this
> and
> > knows what's going on.
> >
> > These are EC2 instances running the latest Ubuntu Precise AMIs.  Here's
> the
> > `uname -a' from one:
> >
> > Linux i-b5dbc1d5 3.2.0-41-virtual #66-Ubuntu SMP Thu Apr 25 03:47:17 UTC
> > 2013 x86_64 x86_64 x86_64 GNU/Linux
> >
> >
> >
> >
> > On Fri, May 10, 2013 at 10:21 AM, Benjamin Mahler <
> > [email protected]
> > > wrote:
> >
> > > Hey Brenden, what kernel version are you running? What version of the
> > code
> > > are you running? Are you seeing this on different machines? Can you
> > confirm
> > > you're unable to attach to the slave with gdb (that would be the
> easiest
> > > way to figure out what's blocking in the slave)? Is that always the
> last
> > > line in the log before the processes get stuck?
> > >
> > > As an aside, it looks like we aren't properly setting the JVM heap size
> > for
> > > TaskTrackers in the MesosScheduler for Hadoop. We try to estimate the
> JVM
> > > memory overhead, but it was perhaps not enough to avoid kernel OOMs.
> > >
> > >
> > > On Fri, May 10, 2013 at 9:57 AM, Brenden Matthews <
> > > [email protected]> wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I'm bumping in to a problem frequently.  When an executor exceeds its
> > > > memory limit and the executor is deemed in need of execution, the
> > entire
> > > > system becomes stuck (though I'm unsure as to whether it's stuck
> before
> > > or
> > > > after attempting to freeze).  Here's what the mesos slave log looks
> > like:
> > > >
> > > > I0510 04:22:37.935608 13622 cgroups_isolator.cpp:1023] OOM notifier
> is
> > > > > triggered for executor executor_Task_Tracker_446 of framework
> > > > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > I0510 04:22:37.935925 13622 cgroups_isolator.cpp:1068] OOM detected
> > for
> > > > > executor executor_Task_Tracker_446 of framework
> > > > > 201305100055-1471680778-5050-18199-0000 with uuid
> > > > > 14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > > > I0510 04:22:37.936982 13622 cgroups_isolator.cpp:1109] Memory limit
> > > > > exceeded: Requested: 2570MB Used: 2570MB
> > > > > MEMORY STATISTICS:
> > > > > cache 1682083840
> > > > > rss 1012756480
> > > > > mapped_file 1085440
> > > > > pgpgin 22863635
> > > > > pgpgout 22205715
> > > > > swap 0
> > > > > pgfault 11881912
> > > > > pgmajfault 304
> > > > > inactive_anon 0
> > > > > active_anon 1012756480
> > > > > inactive_file 1065136128
> > > > > active_file 616947712
> > > > > unevictable 0
> > > > > hierarchical_memory_limit 2694840320
> > > > > hierarchical_memsw_limit 9223372036854775807
> > > > > total_cache 1682083840
> > > > > total_rss 1012756480
> > > > > total_mapped_file 1085440
> > > > > total_pgpgin 22863635
> > > > > total_pgpgout 22205715
> > > > > total_swap 0
> > > > > total_pgfault 11881912
> > > > > total_pgmajfault 304
> > > > > total_inactive_anon 0
> > > > > total_active_anon 1012756480
> > > > > total_inactive_file 1065136128
> > > > > total_active_file 616947712
> > > > > total_unevictable 0
> > > > > I0510 04:22:37.965666 13622 cgroups_isolator.cpp:620] Killing
> > executor
> > > > > executor_Task_Tracker_446 of framework
> > > > > 201305100055-1471680778-5050-18199-0000
> > > > > I0510 04:22:37.967418 13621 cgroups.cpp:1175] Trying to freeze
> cgroup
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_201305100055-1471680778-5050-18199-0000_executor_executor_Task_Tracker_446_tag_14a21d5a-a670-40fd-91b1-54c42b9a30e0
> > > >
> > > >
> > > > After the last line in the log, both the executing task and the mesos
> > > > process are stuck indefinitely.  Usually the way to resolve this is
> > > > `killall -9 java lt-mesos-slave' or a reboot.
> > > >
> > > > If I cat /cgroup/mesos/framework.../freezer.state, it shows that it's
> > > > freezing.  Since the entire system is stuck, I cannot attach a
> debugger
> > > to
> > > > either the JVM or mesos.
> > > >
> > > > Any ideas?  This problem is very frustrating and I'm stuck at this
> > point.
> > > >
> > >
> >
>

Reply via email to