Re: How to debug slave crashes

Brenden Matthews Mon, 17 Jun 2013 12:28:12 -0700

Unfortunately I didn't see you comment in review board until after I came
to the same conclusion myself :)


Anyway, ignoring ESRCH seems to be the trick, but I'll need further testing.


On Fri, Jun 14, 2013 at 10:07 AM, Benjamin Mahler <[email protected]
> wrote:

> If we get a SIGTERM, the slave will terminate, and we added code to dump a
> stack trace on the way out, so that part is expected.
>
> I do see some failures as well:
> F0614 01:00:59.424217 29323 cgroups_isolator.cpp:1152] Failed to destroy
> cgroup
>
> mesos/framework_201306131743-1707151626-5050-13253-0001_executor_executor_Task_Tracker_70_tag_ab6a1b15-95f9-4a32-afd2-1e4b8ffb3b9b:
> Failed to kill tasks in nested cgroups: Collect failed: Failed to send
> Killed to process 27546: No such process
>
> I would suspect this is due to the kill code as well, the bug is in
> cgroups::kill, it currently assumes the cgroup is frozen! We'll have to
> relax that in your change (I'll comment there).
>
> The fact that the slave never correctly registered is also interesting,
> glad to help debug that further!
>
>
> On Thu, Jun 13, 2013 at 8:37 PM, Brenden Matthews <
> [email protected]> wrote:
>
> > PS: I think it was crashing due to the cgroup killTasks() change I made.
> >  Maybe it was because of SIGCONT?  I'll have to investigate this further.
> >
> >
> > On Thu, Jun 13, 2013 at 8:10 PM, Brenden Matthews <
> > [email protected]> wrote:
> >
> > > But shortly after the first 'incident', it says:
> > >
> > > I0614 01:00:59.320652 29256 main.cpp:119] Creating "cgroups" isolator
> > > I0614 01:00:59.321116 29256 main.cpp:127] Build: 2013-05-01 18:49:40 by
> > > I0614 01:00:59.321148 29256 main.cpp:128] Starting Mesos slave
> > >
> > > This looks like some kind of crash to me, unless I'm missing something.
> >  I
> > > should also mention that after this happened the slave never correctly
> > > registered with the master.  I needed to manually restart it first.
> > >
> > >
> > > On Thu, Jun 13, 2013 at 7:37 PM, Benjamin Mahler <
> > > [email protected]> wrote:
> > >
> > >> I can see how this is misleading =/. The slave is not crashing, we've
> > >> recently installed a signal handler that dumps a stacktrace upon
> > receiving
> > >> _any_ signal. So these SIGTERMs look like crashes, but that's actually
> > us
> > >> dumping a stacktrace manually. We may want to clean this up to only
> dump
> > >> for interesting signals.
> > >>
> > >> The first step here would be to figure out what is issuing the
> SIGTERMs,
> > >> can you check out the auth logs of the machine?
> > >>
> > >>
> > >> On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews <
> > >> [email protected]> wrote:
> > >>
> > >> > Okay, this problem is very strange so I'm attaching the entire slave
> > >> log.
> > >> >  This is happening often enough to be a nuisance.  You can see it's
> > >> > crashing (apparently due to SIGTERM? from what? I don't know).
> > >> >
> > >> >
> > >> > On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler <
> > >> > [email protected]> wrote:
> > >> >
> > >> >> Good idea, we do the same here at twitter. Adding core dumps would
> be
> > >> very
> > >> >> useful for some signals, but there won't be a core dump for
> SIGTERM:
> > >> >> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7
> > >> >>
> > >> >>
> > >> >> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews <
> > >> >> [email protected]> wrote:
> > >> >>
> > >> >> > Hmm I didn't even notice that it was SIGTERM.  It seemed to stop
> > >> rather
> > >> >> > spontaneously, and then proceeded to crash several times after
> > >> (though I
> > >> >> > omitted that part of the log).  The slaves are started with
> runit,
> > >> which
> > >> >> > (afaik) doesn't send SIGTERMs randomly. It may have been
> something
> > >> else.
> > >> >> >
> > >> >> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 &&
> > ulimit
> > >> -c
> > >> >> > unlimited' to catch the core dumps.
> > >> >> >
> > >> >> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler
> > >> >> > <[email protected]>wrote:
> > >> >> >
> > >> >> > > The logs you provided show that a SIGTERM was received by the
> > >> slave.
> > >> >> Do
> > >> >> > you
> > >> >> > > have something watching and sometimes killing your slaves? Did
> > >> someone
> > >> >> > > issue a kill on the process?
> > >> >> > >
> > >> >> > >
> > >> >> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews <
> > >> >> > > [email protected]> wrote:
> > >> >> > >
> > >> >> > > > Hey guys,
> > >> >> > > >
> > >> >> > > > I was wondering how you typically debug slave crashes.  I
> > >> frequently
> > >> >> > have
> > >> >> > > > slaves crash, and I've had limited luck with collecting core
> > >> dumps
> > >> >> > > because
> > >> >> > > > of the frequency with which it occurs (it usually crashes
> > >> >> repeatedly,
> > >> >> > and
> > >> >> > > > the cores get overwritten).  I figured it would be quicker to
> > >> just
> > >> >> ask
> > >> >> > > how
> > >> >> > > > you've been doing it in the past, rather than trying to
> > reinvent
> > >> the
> > >> >> > > wheel.
> > >> >> > > >
> > >> >> > > > Here's a sample from the slave log that shows a crash:
> > >> >> > > >
> > >> >> > > > I0613 00:11:18.625746  1200 cgroups_isolator.cpp:864] Updated
> > >> >> > > 'cpu.shares'
> > >> >> > > > > to 1024 for executor executor_Task_Tracker_129 of framework
> > 20
> > >> >> > > > > 1306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:18.629904  1200 cgroups_isolator.cpp:1002]
> > Updated
> > >> >> > > > > 'memory.limit_in_bytes' to 2621440000 for executor
> > >> >> > > > executor_Task_Tracker_1
> > >> >> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:18.630421  1200 cgroups_isolator.cpp:1028]
> > Started
> > >> >> > > listening
> > >> >> > > > > for OOM events for executor executor_Task_Tracker_129 of
> > >> framewo
> > >> >> > > > > rk 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:18.632015  1200 cgroups_isolator.cpp:560]
> Forked
> > >> >> executor
> > >> >> > > at
> > >> >> > > > =
> > >> >> > > > > 6789
> > >> >> > > > > Fetching resources into
> > >> >> > > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f'
> > >> >> > > > > Fetching resource
> > >> >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > >> >> > > > > Downloading resource from
> > >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > >> >> > > > > HDFS command: hadoop fs -copyToLocal
> > >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > >> >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > >> >> > > > > Extracting resource: tar xJf
> > >> >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > >> >> > > > > I0613 00:11:26.225594  1200 slave.cpp:1412] Got
> registration
> > >> for
> > >> >> > > executor
> > >> >> > > > > 'executor_Task_Tracker_129' of framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:26.226132  1197 cgroups_isolator.cpp:667]
> > Changing
> > >> >> cgroup
> > >> >> > > > > controls for executor executor_Task_Tracker_129 of
> framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000 with resources
> > >> cpus=9.25;
> > >> >> > > > mem=21200;
> > >> >> > > > > disk=45056; ports=[31000-31000, 31001-31001]
> > >> >> > > > > I0613 00:11:26.226450  1200 slave.cpp:1527] Flushing queued
> > >> task
> > >> >> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129'
> of
> > >> >> > framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:26.227196  1197 cgroups_isolator.cpp:939]
> Updated
> > >> >> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to
> > 925000
> > >> for
> > >> >> > > > executor
> > >> >> > > > > executor_Task_Tracker_129 of framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:26.227856  1197 cgroups_isolator.cpp:864]
> Updated
> > >> >> > > > 'cpu.shares'
> > >> >> > > > > to 9472 for executor executor_Task_Tracker_129 of framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:26.228564  1197 cgroups_isolator.cpp:1002]
> > Updated
> > >> >> > > > > 'memory.limit_in_bytes' to 22229811200 for executor
> > >> >> > > > > executor_Task_Tracker_129 of framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:27.729706  1210 status_update_manager.cpp:290]
> > >> >> Received
> > >> >> > > > status
> > >> >> > > > > update TASK_RUNNING (UUID:
> > >> e984eac5-a9d7-4546-a7e9-444f5f018cab)
> > >> >> for
> > >> >> > > task
> > >> >> > > > > Task_Tracker_129 of framework
> > >> >> 201306122129-1707151626-5050-5724-0000
> > >> >> > > with
> > >> >> > > > > checkpoint=false
> > >> >> > > > > I0613 00:11:27.730159  1210 status_update_manager.cpp:450]
> > >> >> Creating
> > >> >> > > > > StatusUpdate stream for task Task_Tracker_129 of framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:27.730520  1210 status_update_manager.cpp:336]
> > >> >> Forwarding
> > >> >> > > > > status update TASK_RUNNING (UUID:
> > >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab)
> > >> >> > > > for
> > >> >> > > > > task Task_Tracker_129 of framework
> > >> >> > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > to [email protected]:5050
> > >> >> > > > > I0613 00:11:27.731010  1211 slave.cpp:1821] Sending
> > >> >> acknowledgement
> > >> >> > for
> > >> >> > > > > status update TASK_RUNNING (UUID:
> > >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab)
> > >> >> > > > for
> > >> >> > > > > task Task_Tracker_129 of framework
> > >> >> > > 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > to executor(1)@10.17.178.97:58109
> > >> >> > > > > I0613 00:11:27.735405  1211 status_update_manager.cpp:360]
> > >> >> Received
> > >> >> > > > status
> > >> >> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab
> > for
> > >> >> task
> > >> >> > > > > Task_Tracker_129 of framework
> > >> >> 201306122129-1707151626-5050-5724-0000
> > >> >> > > > > I0613 00:11:47.578543  1202 slave.cpp:2514] Current usage
> > >> 16.54%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.142519016183785days
> > >> >> > > > > I0613 00:12:47.579776  1202 slave.cpp:2514] Current usage
> > >> 16.90%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.117054416194364days
> > >> >> > > > > I0613 00:13:47.581866  1202 slave.cpp:2514] Current usage
> > >> 17.29%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.089524142974595days
> > >> >> > > > > I0613 00:14:47.585180  1209 slave.cpp:2514] Current usage
> > >> 17.61%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.067278719463831days
> > >> >> > > > > I0613 00:15:47.585908  1206 slave.cpp:2514] Current usage
> > >> 17.87%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.049103921517002days
> > >> >> > > > > I0613 00:16:47.588693  1207 slave.cpp:2514] Current usage
> > >> 17.97%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.041750330234468days
> > >> >> > > > > I0613 00:17:28.162240  1212 slave.cpp:1896]
> > >> >> > [email protected]:5050
> > >> >> > > > exited
> > >> >> > > > > W0613 00:17:28.162505  1212 slave.cpp:1899] Master
> > >> disconnected!
> > >> >> > > Waiting
> > >> >> > > > > for a new master to be elected
> > >> >> > > > > I0613 00:17:38.006042  1201 detector.cpp:420] Master
> detector
> > >> >> > > (slave(1)@
> > >> >> > > > > 10.17.178.97:5051)  found 1 registered masters
> > >> >> > > > > I0613 00:17:38.008129  1201 detector.cpp:467] Master
> detector
> > >> >> > > (slave(1)@
> > >> >> > > > > 10.17.178.97:5051)  got new master pid:
> > >> [email protected]:5050
> > >> >> > > > > I0613 00:17:38.008599  1201 slave.cpp:537] New master
> > detected
> > >> at
> > >> >> > > > > [email protected]:5050
> > >> >> > > > > I0613 00:17:38.009098  1198 status_update_manager.cpp:155]
> > New
> > >> >> master
> > >> >> > > > > detected at [email protected]:5050
> > >> >> > > > > I0613 00:17:39.059538  1203 slave.cpp:633] Re-registered
> with
> > >> >> master
> > >> >> > > > > [email protected]:5050
> > >> >> > > > > I0613 00:17:39.059859  1203 slave.cpp:1294] Updating
> > framework
> > >> >> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@
> > >> >> > > > > 10.17.184.87:57804
> > >> >> > > > > I0613 00:17:46.057116  1202 detector.cpp:420] Master
> detector
> > >> >> > > (slave(1)@
> > >> >> > > > > 10.17.178.97:5051)  found 2 registered masters
> > >> >> > > > > I0613 00:17:47.590699  1200 slave.cpp:2514] Current usage
> > >> 18.06%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.036055129997118days
> > >> >> > > > > I0613 00:18:47.592268  1201 slave.cpp:2514] Current usage
> > >> 17.98%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.041545235716157days
> > >> >> > > > > I0613 00:19:47.596873  1204 slave.cpp:2514] Current usage
> > >> 17.86%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.049504905524051days
> > >> >> > > > > I0613 00:20:47.597520  1208 slave.cpp:2514] Current usage
> > >> 17.86%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.049908279608947days
> > >> >> > > > > I0613 00:21:47.598794  1206 slave.cpp:2514] Current usage
> > >> 17.55%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.071801618813565days
> > >> >> > > > > I0613 00:22:47.599805  1202 slave.cpp:2514] Current usage
> > >> 17.56%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.070852822503368days
> > >> >> > > > > I0613 00:23:47.604342  1199 slave.cpp:2514] Current usage
> > >> 17.56%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.070920390650185days
> > >> >> > > > > I0613 00:24:47.605106  1203 slave.cpp:2514] Current usage
> > >> 17.56%.
> > >> >> Max
> > >> >> > > > > allowed age: 5.070920003070000days
> > >> >> > > > > *** Aborted at 1371083126 (unix time) try "date -d
> > >> @1371083126" if
> > >> >> > you
> > >> >> > > > are
> > >> >> > > > > using GNU date ***
> > >> >> > > > > PC: @     0x7f53c240dd84 __pthread_cond_wait
> > >> >> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID
> > 0x7f53c37b8740)
> > >> >> from
> > >> >> > PID
> > >> >> > > > > 1033; stack trace: ***
> > >> >> > > > >     @     0x7f53c2411cb0 (unknown)
> > >> >> > > > >     @     0x7f53c240dd84 __pthread_cond_wait
> > >> >> > > > >     @     0x7f53c3088f03 (unknown)
> > >> >> > > > >     @     0x7f53c308961f (unknown)
> > >> >> > > > >     @           0x40c75a (unknown)
> > >> >> > > > >     @     0x7f53c206476d (unknown)
> > >> >> > > > >     @           0x40d511 (unknown)
> > >> >> > > > > I0613 00:25:26.600497  9946 main.cpp:119] Creating
> "cgroups"
> > >> >> isolator
> > >> >> > > > > I0613 00:25:26.622987  9946 main.cpp:127] Build: 2013-05-09
> > >> >> 22:53:54
> > >> >> > by
> > >> >> > > > > I0613 00:25:26.623016  9946 main.cpp:128] Starting Mesos
> > slave
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>

Re: How to debug slave crashes

Reply via email to