Hmm I didn't even notice that it was SIGTERM.  It seemed to stop rather
spontaneously, and then proceeded to crash several times after (though I
omitted that part of the log).  The slaves are started with runit, which
(afaik) doesn't send SIGTERMs randomly. It may have been something else.

In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && ulimit -c
unlimited' to catch the core dumps.

On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler
<[email protected]>wrote:

> The logs you provided show that a SIGTERM was received by the slave. Do you
> have something watching and sometimes killing your slaves? Did someone
> issue a kill on the process?
>
>
> On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews <
> [email protected]> wrote:
>
> > Hey guys,
> >
> > I was wondering how you typically debug slave crashes.  I frequently have
> > slaves crash, and I've had limited luck with collecting core dumps
> because
> > of the frequency with which it occurs (it usually crashes repeatedly, and
> > the cores get overwritten).  I figured it would be quicker to just ask
> how
> > you've been doing it in the past, rather than trying to reinvent the
> wheel.
> >
> > Here's a sample from the slave log that shows a crash:
> >
> > I0613 00:11:18.625746  1200 cgroups_isolator.cpp:864] Updated
> 'cpu.shares'
> > > to 1024 for executor executor_Task_Tracker_129 of framework 20
> > > 1306122129-1707151626-5050-5724-0000
> > > I0613 00:11:18.629904  1200 cgroups_isolator.cpp:1002] Updated
> > > 'memory.limit_in_bytes' to 2621440000 for executor
> > executor_Task_Tracker_1
> > > 29 of framework 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:18.630421  1200 cgroups_isolator.cpp:1028] Started
> listening
> > > for OOM events for executor executor_Task_Tracker_129 of framewo
> > > rk 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:18.632015  1200 cgroups_isolator.cpp:560] Forked executor
> at
> > =
> > > 6789
> > > Fetching resources into
> > >
> >
> '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f'
> > > Fetching resource
> > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > > Downloading resource from
> > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > > HDFS command: hadoop fs -copyToLocal
> > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > > Extracting resource: tar xJf './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> > > I0613 00:11:26.225594  1200 slave.cpp:1412] Got registration for
> executor
> > > 'executor_Task_Tracker_129' of framework
> > > 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:26.226132  1197 cgroups_isolator.cpp:667] Changing cgroup
> > > controls for executor executor_Task_Tracker_129 of framework
> > > 201306122129-1707151626-5050-5724-0000 with resources cpus=9.25;
> > mem=21200;
> > > disk=45056; ports=[31000-31000, 31001-31001]
> > > I0613 00:11:26.226450  1200 slave.cpp:1527] Flushing queued task
> > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of framework
> > > 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:26.227196  1197 cgroups_isolator.cpp:939] Updated
> > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 for
> > executor
> > > executor_Task_Tracker_129 of framework
> > > 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:26.227856  1197 cgroups_isolator.cpp:864] Updated
> > 'cpu.shares'
> > > to 9472 for executor executor_Task_Tracker_129 of framework
> > > 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:26.228564  1197 cgroups_isolator.cpp:1002] Updated
> > > 'memory.limit_in_bytes' to 22229811200 for executor
> > > executor_Task_Tracker_129 of framework
> > > 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:27.729706  1210 status_update_manager.cpp:290] Received
> > status
> > > update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) for
> task
> > > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000
> with
> > > checkpoint=false
> > > I0613 00:11:27.730159  1210 status_update_manager.cpp:450] Creating
> > > StatusUpdate stream for task Task_Tracker_129 of framework
> > > 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:27.730520  1210 status_update_manager.cpp:336] Forwarding
> > > status update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab)
> > for
> > > task Task_Tracker_129 of framework
> 201306122129-1707151626-5050-5724-0000
> > > to [email protected]:5050
> > > I0613 00:11:27.731010  1211 slave.cpp:1821] Sending acknowledgement for
> > > status update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab)
> > for
> > > task Task_Tracker_129 of framework
> 201306122129-1707151626-5050-5724-0000
> > > to executor(1)@10.17.178.97:58109
> > > I0613 00:11:27.735405  1211 status_update_manager.cpp:360] Received
> > status
> > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for task
> > > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000
> > > I0613 00:11:47.578543  1202 slave.cpp:2514] Current usage 16.54%. Max
> > > allowed age: 5.142519016183785days
> > > I0613 00:12:47.579776  1202 slave.cpp:2514] Current usage 16.90%. Max
> > > allowed age: 5.117054416194364days
> > > I0613 00:13:47.581866  1202 slave.cpp:2514] Current usage 17.29%. Max
> > > allowed age: 5.089524142974595days
> > > I0613 00:14:47.585180  1209 slave.cpp:2514] Current usage 17.61%. Max
> > > allowed age: 5.067278719463831days
> > > I0613 00:15:47.585908  1206 slave.cpp:2514] Current usage 17.87%. Max
> > > allowed age: 5.049103921517002days
> > > I0613 00:16:47.588693  1207 slave.cpp:2514] Current usage 17.97%. Max
> > > allowed age: 5.041750330234468days
> > > I0613 00:17:28.162240  1212 slave.cpp:1896] [email protected]:5050
> > exited
> > > W0613 00:17:28.162505  1212 slave.cpp:1899] Master disconnected!
> Waiting
> > > for a new master to be elected
> > > I0613 00:17:38.006042  1201 detector.cpp:420] Master detector
> (slave(1)@
> > > 10.17.178.97:5051)  found 1 registered masters
> > > I0613 00:17:38.008129  1201 detector.cpp:467] Master detector
> (slave(1)@
> > > 10.17.178.97:5051)  got new master pid: [email protected]:5050
> > > I0613 00:17:38.008599  1201 slave.cpp:537] New master detected at
> > > [email protected]:5050
> > > I0613 00:17:38.009098  1198 status_update_manager.cpp:155] New master
> > > detected at [email protected]:5050
> > > I0613 00:17:39.059538  1203 slave.cpp:633] Re-registered with master
> > > [email protected]:5050
> > > I0613 00:17:39.059859  1203 slave.cpp:1294] Updating framework
> > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@
> > > 10.17.184.87:57804
> > > I0613 00:17:46.057116  1202 detector.cpp:420] Master detector
> (slave(1)@
> > > 10.17.178.97:5051)  found 2 registered masters
> > > I0613 00:17:47.590699  1200 slave.cpp:2514] Current usage 18.06%. Max
> > > allowed age: 5.036055129997118days
> > > I0613 00:18:47.592268  1201 slave.cpp:2514] Current usage 17.98%. Max
> > > allowed age: 5.041545235716157days
> > > I0613 00:19:47.596873  1204 slave.cpp:2514] Current usage 17.86%. Max
> > > allowed age: 5.049504905524051days
> > > I0613 00:20:47.597520  1208 slave.cpp:2514] Current usage 17.86%. Max
> > > allowed age: 5.049908279608947days
> > > I0613 00:21:47.598794  1206 slave.cpp:2514] Current usage 17.55%. Max
> > > allowed age: 5.071801618813565days
> > > I0613 00:22:47.599805  1202 slave.cpp:2514] Current usage 17.56%. Max
> > > allowed age: 5.070852822503368days
> > > I0613 00:23:47.604342  1199 slave.cpp:2514] Current usage 17.56%. Max
> > > allowed age: 5.070920390650185days
> > > I0613 00:24:47.605106  1203 slave.cpp:2514] Current usage 17.56%. Max
> > > allowed age: 5.070920003070000days
> > > *** Aborted at 1371083126 (unix time) try "date -d @1371083126" if you
> > are
> > > using GNU date ***
> > > PC: @     0x7f53c240dd84 __pthread_cond_wait
> > > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740) from PID
> > > 1033; stack trace: ***
> > >     @     0x7f53c2411cb0 (unknown)
> > >     @     0x7f53c240dd84 __pthread_cond_wait
> > >     @     0x7f53c3088f03 (unknown)
> > >     @     0x7f53c308961f (unknown)
> > >     @           0x40c75a (unknown)
> > >     @     0x7f53c206476d (unknown)
> > >     @           0x40d511 (unknown)
> > > I0613 00:25:26.600497  9946 main.cpp:119] Creating "cgroups" isolator
> > > I0613 00:25:26.622987  9946 main.cpp:127] Build: 2013-05-09 22:53:54 by
> > > I0613 00:25:26.623016  9946 main.cpp:128] Starting Mesos slave
> >
>

Reply via email to