Re: How to debug slave crashes

Benjamin Mahler Fri, 14 Jun 2013 10:09:17 -0700

If we get a SIGTERM, the slave will terminate, and we added code to dump a
stack trace on the way out, so that part is expected.


I do see some failures as well:
F0614 01:00:59.424217 29323 cgroups_isolator.cpp:1152] Failed to destroy
cgroup
mesos/framework_201306131743-1707151626-5050-13253-0001_executor_executor_Task_Tracker_70_tag_ab6a1b15-95f9-4a32-afd2-1e4b8ffb3b9b:
Failed to kill tasks in nested cgroups: Collect failed: Failed to send
Killed to process 27546: No such process

I would suspect this is due to the kill code as well, the bug is in
cgroups::kill, it currently assumes the cgroup is frozen! We'll have to
relax that in your change (I'll comment there).

The fact that the slave never correctly registered is also interesting,
glad to help debug that further!


On Thu, Jun 13, 2013 at 8:37 PM, Brenden Matthews <
[email protected]> wrote:

> PS: I think it was crashing due to the cgroup killTasks() change I made.
>  Maybe it was because of SIGCONT?  I'll have to investigate this further.
>
>
> On Thu, Jun 13, 2013 at 8:10 PM, Brenden Matthews <
> [email protected]> wrote:
>
> > But shortly after the first 'incident', it says:
> >
> > I0614 01:00:59.320652 29256 main.cpp:119] Creating "cgroups" isolator
> > I0614 01:00:59.321116 29256 main.cpp:127] Build: 2013-05-01 18:49:40 by
> > I0614 01:00:59.321148 29256 main.cpp:128] Starting Mesos slave
> >
> > This looks like some kind of crash to me, unless I'm missing something.
>  I
> > should also mention that after this happened the slave never correctly
> > registered with the master.  I needed to manually restart it first.
> >
> >
> > On Thu, Jun 13, 2013 at 7:37 PM, Benjamin Mahler <
> > [email protected]> wrote:
> >
> >> I can see how this is misleading =/. The slave is not crashing, we've
> >> recently installed a signal handler that dumps a stacktrace upon
> receiving
> >> _any_ signal. So these SIGTERMs look like crashes, but that's actually
> us
> >> dumping a stacktrace manually. We may want to clean this up to only dump
> >> for interesting signals.
> >>
> >> The first step here would be to figure out what is issuing the SIGTERMs,
> >> can you check out the auth logs of the machine?
> >>
> >>
> >> On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews <
> >> [email protected]> wrote:
> >>
> >> > Okay, this problem is very strange so I'm attaching the entire slave
> >> log.
> >> >  This is happening often enough to be a nuisance.  You can see it's
> >> > crashing (apparently due to SIGTERM? from what? I don't know).
> >> >
> >> >
> >> > On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler <
> >> > [email protected]> wrote:
> >> >
> >> >> Good idea, we do the same here at twitter. Adding core dumps would be
> >> very
> >> >> useful for some signals, but there won't be a core dump for SIGTERM:
> >> >> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7
> >> >>
> >> >>
> >> >> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews <
> >> >> [email protected]> wrote:
> >> >>
> >> >> > Hmm I didn't even notice that it was SIGTERM.  It seemed to stop
> >> rather
> >> >> > spontaneously, and then proceeded to crash several times after
> >> (though I
> >> >> > omitted that part of the log).  The slaves are started with runit,
> >> which
> >> >> > (afaik) doesn't send SIGTERMs randomly. It may have been something
> >> else.
> >> >> >
> >> >> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 &&
> ulimit
> >> -c
> >> >> > unlimited' to catch the core dumps.
> >> >> >
> >> >> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler
> >> >> > <[email protected]>wrote:
> >> >> >
> >> >> > > The logs you provided show that a SIGTERM was received by the
> >> slave.
> >> >> Do
> >> >> > you
> >> >> > > have something watching and sometimes killing your slaves? Did
> >> someone
> >> >> > > issue a kill on the process?
> >> >> > >
> >> >> > >
> >> >> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews <
> >> >> > > [email protected]> wrote:
> >> >> > >
> >> >> > > > Hey guys,
> >> >> > > >
> >> >> > > > I was wondering how you typically debug slave crashes.  I
> >> frequently
> >> >> > have
> >> >> > > > slaves crash, and I've had limited luck with collecting core
> >> dumps
> >> >> > > because
> >> >> > > > of the frequency with which it occurs (it usually crashes
> >> >> repeatedly,
> >> >> > and
> >> >> > > > the cores get overwritten).  I figured it would be quicker to
> >> just
> >> >> ask
> >> >> > > how
> >> >> > > > you've been doing it in the past, rather than trying to
> reinvent
> >> the
> >> >> > > wheel.
> >> >> > > >
> >> >> > > > Here's a sample from the slave log that shows a crash:
> >> >> > > >
> >> >> > > > I0613 00:11:18.625746  1200 cgroups_isolator.cpp:864] Updated
> >> >> > > 'cpu.shares'
> >> >> > > > > to 1024 for executor executor_Task_Tracker_129 of framework
> 20
> >> >> > > > > 1306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:18.629904  1200 cgroups_isolator.cpp:1002]
> Updated
> >> >> > > > > 'memory.limit_in_bytes' to 2621440000 for executor
> >> >> > > > executor_Task_Tracker_1
> >> >> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:18.630421  1200 cgroups_isolator.cpp:1028]
> Started
> >> >> > > listening
> >> >> > > > > for OOM events for executor executor_Task_Tracker_129 of
> >> framewo
> >> >> > > > > rk 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:18.632015  1200 cgroups_isolator.cpp:560] Forked
> >> >> executor
> >> >> > > at
> >> >> > > > =
> >> >> > > > > 6789
> >> >> > > > > Fetching resources into
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f'
> >> >> > > > > Fetching resource
> >> >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> >> >> > > > > Downloading resource from
> >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> >> >> > > > > HDFS command: hadoop fs -copyToLocal
> >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> >> >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> >> >> > > > > Extracting resource: tar xJf
> >> >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> >> >> > > > > I0613 00:11:26.225594  1200 slave.cpp:1412] Got registration
> >> for
> >> >> > > executor
> >> >> > > > > 'executor_Task_Tracker_129' of framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:26.226132  1197 cgroups_isolator.cpp:667]
> Changing
> >> >> cgroup
> >> >> > > > > controls for executor executor_Task_Tracker_129 of framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000 with resources
> >> cpus=9.25;
> >> >> > > > mem=21200;
> >> >> > > > > disk=45056; ports=[31000-31000, 31001-31001]
> >> >> > > > > I0613 00:11:26.226450  1200 slave.cpp:1527] Flushing queued
> >> task
> >> >> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of
> >> >> > framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:26.227196  1197 cgroups_isolator.cpp:939] Updated
> >> >> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to
> 925000
> >> for
> >> >> > > > executor
> >> >> > > > > executor_Task_Tracker_129 of framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:26.227856  1197 cgroups_isolator.cpp:864] Updated
> >> >> > > > 'cpu.shares'
> >> >> > > > > to 9472 for executor executor_Task_Tracker_129 of framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:26.228564  1197 cgroups_isolator.cpp:1002]
> Updated
> >> >> > > > > 'memory.limit_in_bytes' to 22229811200 for executor
> >> >> > > > > executor_Task_Tracker_129 of framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:27.729706  1210 status_update_manager.cpp:290]
> >> >> Received
> >> >> > > > status
> >> >> > > > > update TASK_RUNNING (UUID:
> >> e984eac5-a9d7-4546-a7e9-444f5f018cab)
> >> >> for
> >> >> > > task
> >> >> > > > > Task_Tracker_129 of framework
> >> >> 201306122129-1707151626-5050-5724-0000
> >> >> > > with
> >> >> > > > > checkpoint=false
> >> >> > > > > I0613 00:11:27.730159  1210 status_update_manager.cpp:450]
> >> >> Creating
> >> >> > > > > StatusUpdate stream for task Task_Tracker_129 of framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:27.730520  1210 status_update_manager.cpp:336]
> >> >> Forwarding
> >> >> > > > > status update TASK_RUNNING (UUID:
> >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab)
> >> >> > > > for
> >> >> > > > > task Task_Tracker_129 of framework
> >> >> > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > to [email protected]:5050
> >> >> > > > > I0613 00:11:27.731010  1211 slave.cpp:1821] Sending
> >> >> acknowledgement
> >> >> > for
> >> >> > > > > status update TASK_RUNNING (UUID:
> >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab)
> >> >> > > > for
> >> >> > > > > task Task_Tracker_129 of framework
> >> >> > > 201306122129-1707151626-5050-5724-0000
> >> >> > > > > to executor(1)@10.17.178.97:58109
> >> >> > > > > I0613 00:11:27.735405  1211 status_update_manager.cpp:360]
> >> >> Received
> >> >> > > > status
> >> >> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab
> for
> >> >> task
> >> >> > > > > Task_Tracker_129 of framework
> >> >> 201306122129-1707151626-5050-5724-0000
> >> >> > > > > I0613 00:11:47.578543  1202 slave.cpp:2514] Current usage
> >> 16.54%.
> >> >> Max
> >> >> > > > > allowed age: 5.142519016183785days
> >> >> > > > > I0613 00:12:47.579776  1202 slave.cpp:2514] Current usage
> >> 16.90%.
> >> >> Max
> >> >> > > > > allowed age: 5.117054416194364days
> >> >> > > > > I0613 00:13:47.581866  1202 slave.cpp:2514] Current usage
> >> 17.29%.
> >> >> Max
> >> >> > > > > allowed age: 5.089524142974595days
> >> >> > > > > I0613 00:14:47.585180  1209 slave.cpp:2514] Current usage
> >> 17.61%.
> >> >> Max
> >> >> > > > > allowed age: 5.067278719463831days
> >> >> > > > > I0613 00:15:47.585908  1206 slave.cpp:2514] Current usage
> >> 17.87%.
> >> >> Max
> >> >> > > > > allowed age: 5.049103921517002days
> >> >> > > > > I0613 00:16:47.588693  1207 slave.cpp:2514] Current usage
> >> 17.97%.
> >> >> Max
> >> >> > > > > allowed age: 5.041750330234468days
> >> >> > > > > I0613 00:17:28.162240  1212 slave.cpp:1896]
> >> >> > [email protected]:5050
> >> >> > > > exited
> >> >> > > > > W0613 00:17:28.162505  1212 slave.cpp:1899] Master
> >> disconnected!
> >> >> > > Waiting
> >> >> > > > > for a new master to be elected
> >> >> > > > > I0613 00:17:38.006042  1201 detector.cpp:420] Master detector
> >> >> > > (slave(1)@
> >> >> > > > > 10.17.178.97:5051)  found 1 registered masters
> >> >> > > > > I0613 00:17:38.008129  1201 detector.cpp:467] Master detector
> >> >> > > (slave(1)@
> >> >> > > > > 10.17.178.97:5051)  got new master pid:
> >> [email protected]:5050
> >> >> > > > > I0613 00:17:38.008599  1201 slave.cpp:537] New master
> detected
> >> at
> >> >> > > > > [email protected]:5050
> >> >> > > > > I0613 00:17:38.009098  1198 status_update_manager.cpp:155]
> New
> >> >> master
> >> >> > > > > detected at [email protected]:5050
> >> >> > > > > I0613 00:17:39.059538  1203 slave.cpp:633] Re-registered with
> >> >> master
> >> >> > > > > [email protected]:5050
> >> >> > > > > I0613 00:17:39.059859  1203 slave.cpp:1294] Updating
> framework
> >> >> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@
> >> >> > > > > 10.17.184.87:57804
> >> >> > > > > I0613 00:17:46.057116  1202 detector.cpp:420] Master detector
> >> >> > > (slave(1)@
> >> >> > > > > 10.17.178.97:5051)  found 2 registered masters
> >> >> > > > > I0613 00:17:47.590699  1200 slave.cpp:2514] Current usage
> >> 18.06%.
> >> >> Max
> >> >> > > > > allowed age: 5.036055129997118days
> >> >> > > > > I0613 00:18:47.592268  1201 slave.cpp:2514] Current usage
> >> 17.98%.
> >> >> Max
> >> >> > > > > allowed age: 5.041545235716157days
> >> >> > > > > I0613 00:19:47.596873  1204 slave.cpp:2514] Current usage
> >> 17.86%.
> >> >> Max
> >> >> > > > > allowed age: 5.049504905524051days
> >> >> > > > > I0613 00:20:47.597520  1208 slave.cpp:2514] Current usage
> >> 17.86%.
> >> >> Max
> >> >> > > > > allowed age: 5.049908279608947days
> >> >> > > > > I0613 00:21:47.598794  1206 slave.cpp:2514] Current usage
> >> 17.55%.
> >> >> Max
> >> >> > > > > allowed age: 5.071801618813565days
> >> >> > > > > I0613 00:22:47.599805  1202 slave.cpp:2514] Current usage
> >> 17.56%.
> >> >> Max
> >> >> > > > > allowed age: 5.070852822503368days
> >> >> > > > > I0613 00:23:47.604342  1199 slave.cpp:2514] Current usage
> >> 17.56%.
> >> >> Max
> >> >> > > > > allowed age: 5.070920390650185days
> >> >> > > > > I0613 00:24:47.605106  1203 slave.cpp:2514] Current usage
> >> 17.56%.
> >> >> Max
> >> >> > > > > allowed age: 5.070920003070000days
> >> >> > > > > *** Aborted at 1371083126 (unix time) try "date -d
> >> @1371083126" if
> >> >> > you
> >> >> > > > are
> >> >> > > > > using GNU date ***
> >> >> > > > > PC: @     0x7f53c240dd84 __pthread_cond_wait
> >> >> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID
> 0x7f53c37b8740)
> >> >> from
> >> >> > PID
> >> >> > > > > 1033; stack trace: ***
> >> >> > > > >     @     0x7f53c2411cb0 (unknown)
> >> >> > > > >     @     0x7f53c240dd84 __pthread_cond_wait
> >> >> > > > >     @     0x7f53c3088f03 (unknown)
> >> >> > > > >     @     0x7f53c308961f (unknown)
> >> >> > > > >     @           0x40c75a (unknown)
> >> >> > > > >     @     0x7f53c206476d (unknown)
> >> >> > > > >     @           0x40d511 (unknown)
> >> >> > > > > I0613 00:25:26.600497  9946 main.cpp:119] Creating "cgroups"
> >> >> isolator
> >> >> > > > > I0613 00:25:26.622987  9946 main.cpp:127] Build: 2013-05-09
> >> >> 22:53:54
> >> >> > by
> >> >> > > > > I0613 00:25:26.623016  9946 main.cpp:128] Starting Mesos
> slave
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Re: How to debug slave crashes

Reply via email to