Re: How to debug slave crashes

Benjamin Mahler Thu, 13 Jun 2013 19:39:01 -0700

I can see how this is misleading =/. The slave is not crashing, we've
recently installed a signal handler that dumps a stacktrace upon receiving
_any_ signal. So these SIGTERMs look like crashes, but that's actually us
dumping a stacktrace manually. We may want to clean this up to only dump
for interesting signals.


The first step here would be to figure out what is issuing the SIGTERMs,
can you check out the auth logs of the machine?


On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews <
[email protected]> wrote:

> Okay, this problem is very strange so I'm attaching the entire slave log.
>  This is happening often enough to be a nuisance.  You can see it's
> crashing (apparently due to SIGTERM? from what? I don't know).
>
>
> On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler <
> [email protected]> wrote:
>
>> Good idea, we do the same here at twitter. Adding core dumps would be very
>> useful for some signals, but there won't be a core dump for SIGTERM:
>> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7
>>
>>
>> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews <
>> [email protected]> wrote:
>>
>> > Hmm I didn't even notice that it was SIGTERM.  It seemed to stop rather
>> > spontaneously, and then proceeded to crash several times after (though I
>> > omitted that part of the log).  The slaves are started with runit, which
>> > (afaik) doesn't send SIGTERMs randomly. It may have been something else.
>> >
>> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && ulimit -c
>> > unlimited' to catch the core dumps.
>> >
>> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler
>> > <[email protected]>wrote:
>> >
>> > > The logs you provided show that a SIGTERM was received by the slave.
>> Do
>> > you
>> > > have something watching and sometimes killing your slaves? Did someone
>> > > issue a kill on the process?
>> > >
>> > >
>> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews <
>> > > [email protected]> wrote:
>> > >
>> > > > Hey guys,
>> > > >
>> > > > I was wondering how you typically debug slave crashes.  I frequently
>> > have
>> > > > slaves crash, and I've had limited luck with collecting core dumps
>> > > because
>> > > > of the frequency with which it occurs (it usually crashes
>> repeatedly,
>> > and
>> > > > the cores get overwritten).  I figured it would be quicker to just
>> ask
>> > > how
>> > > > you've been doing it in the past, rather than trying to reinvent the
>> > > wheel.
>> > > >
>> > > > Here's a sample from the slave log that shows a crash:
>> > > >
>> > > > I0613 00:11:18.625746  1200 cgroups_isolator.cpp:864] Updated
>> > > 'cpu.shares'
>> > > > > to 1024 for executor executor_Task_Tracker_129 of framework 20
>> > > > > 1306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:18.629904  1200 cgroups_isolator.cpp:1002] Updated
>> > > > > 'memory.limit_in_bytes' to 2621440000 for executor
>> > > > executor_Task_Tracker_1
>> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:18.630421  1200 cgroups_isolator.cpp:1028] Started
>> > > listening
>> > > > > for OOM events for executor executor_Task_Tracker_129 of framewo
>> > > > > rk 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:18.632015  1200 cgroups_isolator.cpp:560] Forked
>> executor
>> > > at
>> > > > =
>> > > > > 6789
>> > > > > Fetching resources into
>> > > > >
>> > > >
>> > >
>> >
>> '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f'
>> > > > > Fetching resource
>> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > Downloading resource from
>> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > HDFS command: hadoop fs -copyToLocal
>> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > Extracting resource: tar xJf
>> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > I0613 00:11:26.225594  1200 slave.cpp:1412] Got registration for
>> > > executor
>> > > > > 'executor_Task_Tracker_129' of framework
>> > > > > 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:26.226132  1197 cgroups_isolator.cpp:667] Changing
>> cgroup
>> > > > > controls for executor executor_Task_Tracker_129 of framework
>> > > > > 201306122129-1707151626-5050-5724-0000 with resources cpus=9.25;
>> > > > mem=21200;
>> > > > > disk=45056; ports=[31000-31000, 31001-31001]
>> > > > > I0613 00:11:26.226450  1200 slave.cpp:1527] Flushing queued task
>> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of
>> > framework
>> > > > > 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:26.227196  1197 cgroups_isolator.cpp:939] Updated
>> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 for
>> > > > executor
>> > > > > executor_Task_Tracker_129 of framework
>> > > > > 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:26.227856  1197 cgroups_isolator.cpp:864] Updated
>> > > > 'cpu.shares'
>> > > > > to 9472 for executor executor_Task_Tracker_129 of framework
>> > > > > 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:26.228564  1197 cgroups_isolator.cpp:1002] Updated
>> > > > > 'memory.limit_in_bytes' to 22229811200 for executor
>> > > > > executor_Task_Tracker_129 of framework
>> > > > > 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:27.729706  1210 status_update_manager.cpp:290]
>> Received
>> > > > status
>> > > > > update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab)
>> for
>> > > task
>> > > > > Task_Tracker_129 of framework
>> 201306122129-1707151626-5050-5724-0000
>> > > with
>> > > > > checkpoint=false
>> > > > > I0613 00:11:27.730159  1210 status_update_manager.cpp:450]
>> Creating
>> > > > > StatusUpdate stream for task Task_Tracker_129 of framework
>> > > > > 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:27.730520  1210 status_update_manager.cpp:336]
>> Forwarding
>> > > > > status update TASK_RUNNING (UUID:
>> > e984eac5-a9d7-4546-a7e9-444f5f018cab)
>> > > > for
>> > > > > task Task_Tracker_129 of framework
>> > > 201306122129-1707151626-5050-5724-0000
>> > > > > to [email protected]:5050
>> > > > > I0613 00:11:27.731010  1211 slave.cpp:1821] Sending
>> acknowledgement
>> > for
>> > > > > status update TASK_RUNNING (UUID:
>> > e984eac5-a9d7-4546-a7e9-444f5f018cab)
>> > > > for
>> > > > > task Task_Tracker_129 of framework
>> > > 201306122129-1707151626-5050-5724-0000
>> > > > > to executor(1)@10.17.178.97:58109
>> > > > > I0613 00:11:27.735405  1211 status_update_manager.cpp:360]
>> Received
>> > > > status
>> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for
>> task
>> > > > > Task_Tracker_129 of framework
>> 201306122129-1707151626-5050-5724-0000
>> > > > > I0613 00:11:47.578543  1202 slave.cpp:2514] Current usage 16.54%.
>> Max
>> > > > > allowed age: 5.142519016183785days
>> > > > > I0613 00:12:47.579776  1202 slave.cpp:2514] Current usage 16.90%.
>> Max
>> > > > > allowed age: 5.117054416194364days
>> > > > > I0613 00:13:47.581866  1202 slave.cpp:2514] Current usage 17.29%.
>> Max
>> > > > > allowed age: 5.089524142974595days
>> > > > > I0613 00:14:47.585180  1209 slave.cpp:2514] Current usage 17.61%.
>> Max
>> > > > > allowed age: 5.067278719463831days
>> > > > > I0613 00:15:47.585908  1206 slave.cpp:2514] Current usage 17.87%.
>> Max
>> > > > > allowed age: 5.049103921517002days
>> > > > > I0613 00:16:47.588693  1207 slave.cpp:2514] Current usage 17.97%.
>> Max
>> > > > > allowed age: 5.041750330234468days
>> > > > > I0613 00:17:28.162240  1212 slave.cpp:1896]
>> > [email protected]:5050
>> > > > exited
>> > > > > W0613 00:17:28.162505  1212 slave.cpp:1899] Master disconnected!
>> > > Waiting
>> > > > > for a new master to be elected
>> > > > > I0613 00:17:38.006042  1201 detector.cpp:420] Master detector
>> > > (slave(1)@
>> > > > > 10.17.178.97:5051)  found 1 registered masters
>> > > > > I0613 00:17:38.008129  1201 detector.cpp:467] Master detector
>> > > (slave(1)@
>> > > > > 10.17.178.97:5051)  got new master pid: [email protected]:5050
>> > > > > I0613 00:17:38.008599  1201 slave.cpp:537] New master detected at
>> > > > > [email protected]:5050
>> > > > > I0613 00:17:38.009098  1198 status_update_manager.cpp:155] New
>> master
>> > > > > detected at [email protected]:5050
>> > > > > I0613 00:17:39.059538  1203 slave.cpp:633] Re-registered with
>> master
>> > > > > [email protected]:5050
>> > > > > I0613 00:17:39.059859  1203 slave.cpp:1294] Updating framework
>> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@
>> > > > > 10.17.184.87:57804
>> > > > > I0613 00:17:46.057116  1202 detector.cpp:420] Master detector
>> > > (slave(1)@
>> > > > > 10.17.178.97:5051)  found 2 registered masters
>> > > > > I0613 00:17:47.590699  1200 slave.cpp:2514] Current usage 18.06%.
>> Max
>> > > > > allowed age: 5.036055129997118days
>> > > > > I0613 00:18:47.592268  1201 slave.cpp:2514] Current usage 17.98%.
>> Max
>> > > > > allowed age: 5.041545235716157days
>> > > > > I0613 00:19:47.596873  1204 slave.cpp:2514] Current usage 17.86%.
>> Max
>> > > > > allowed age: 5.049504905524051days
>> > > > > I0613 00:20:47.597520  1208 slave.cpp:2514] Current usage 17.86%.
>> Max
>> > > > > allowed age: 5.049908279608947days
>> > > > > I0613 00:21:47.598794  1206 slave.cpp:2514] Current usage 17.55%.
>> Max
>> > > > > allowed age: 5.071801618813565days
>> > > > > I0613 00:22:47.599805  1202 slave.cpp:2514] Current usage 17.56%.
>> Max
>> > > > > allowed age: 5.070852822503368days
>> > > > > I0613 00:23:47.604342  1199 slave.cpp:2514] Current usage 17.56%.
>> Max
>> > > > > allowed age: 5.070920390650185days
>> > > > > I0613 00:24:47.605106  1203 slave.cpp:2514] Current usage 17.56%.
>> Max
>> > > > > allowed age: 5.070920003070000days
>> > > > > *** Aborted at 1371083126 (unix time) try "date -d @1371083126" if
>> > you
>> > > > are
>> > > > > using GNU date ***
>> > > > > PC: @     0x7f53c240dd84 __pthread_cond_wait
>> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740)
>> from
>> > PID
>> > > > > 1033; stack trace: ***
>> > > > >     @     0x7f53c2411cb0 (unknown)
>> > > > >     @     0x7f53c240dd84 __pthread_cond_wait
>> > > > >     @     0x7f53c3088f03 (unknown)
>> > > > >     @     0x7f53c308961f (unknown)
>> > > > >     @           0x40c75a (unknown)
>> > > > >     @     0x7f53c206476d (unknown)
>> > > > >     @           0x40d511 (unknown)
>> > > > > I0613 00:25:26.600497  9946 main.cpp:119] Creating "cgroups"
>> isolator
>> > > > > I0613 00:25:26.622987  9946 main.cpp:127] Build: 2013-05-09
>> 22:53:54
>> > by
>> > > > > I0613 00:25:26.623016  9946 main.cpp:128] Starting Mesos slave
>> > > >
>> > >
>> >
>>
>
>

Re: How to debug slave crashes

Reply via email to