Hmm I didn't even notice that it was SIGTERM. It seemed to stop rather spontaneously, and then proceeded to crash several times after (though I omitted that part of the log). The slaves are started with runit, which (afaik) doesn't send SIGTERMs randomly. It may have been something else.
In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && ulimit -c unlimited' to catch the core dumps. On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler <[email protected]>wrote: > The logs you provided show that a SIGTERM was received by the slave. Do you > have something watching and sometimes killing your slaves? Did someone > issue a kill on the process? > > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews < > [email protected]> wrote: > > > Hey guys, > > > > I was wondering how you typically debug slave crashes. I frequently have > > slaves crash, and I've had limited luck with collecting core dumps > because > > of the frequency with which it occurs (it usually crashes repeatedly, and > > the cores get overwritten). I figured it would be quicker to just ask > how > > you've been doing it in the past, rather than trying to reinvent the > wheel. > > > > Here's a sample from the slave log that shows a crash: > > > > I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated > 'cpu.shares' > > > to 1024 for executor executor_Task_Tracker_129 of framework 20 > > > 1306122129-1707151626-5050-5724-0000 > > > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] Updated > > > 'memory.limit_in_bytes' to 2621440000 for executor > > executor_Task_Tracker_1 > > > 29 of framework 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] Started > listening > > > for OOM events for executor executor_Task_Tracker_129 of framewo > > > rk 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] Forked executor > at > > = > > > 6789 > > > Fetching resources into > > > > > > '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' > > > Fetching resource > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > Downloading resource from > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > HDFS command: hadoop fs -copyToLocal > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > Extracting resource: tar xJf './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > I0613 00:11:26.225594 1200 slave.cpp:1412] Got registration for > executor > > > 'executor_Task_Tracker_129' of framework > > > 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] Changing cgroup > > > controls for executor executor_Task_Tracker_129 of framework > > > 201306122129-1707151626-5050-5724-0000 with resources cpus=9.25; > > mem=21200; > > > disk=45056; ports=[31000-31000, 31001-31001] > > > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued task > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of framework > > > 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] Updated > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 for > > executor > > > executor_Task_Tracker_129 of framework > > > 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] Updated > > 'cpu.shares' > > > to 9472 for executor executor_Task_Tracker_129 of framework > > > 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] Updated > > > 'memory.limit_in_bytes' to 22229811200 for executor > > > executor_Task_Tracker_129 of framework > > > 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] Received > > status > > > update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) for > task > > > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > with > > > checkpoint=false > > > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] Creating > > > StatusUpdate stream for task Task_Tracker_129 of framework > > > 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] Forwarding > > > status update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) > > for > > > task Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 > > > to [email protected]:5050 > > > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending acknowledgement for > > > status update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) > > for > > > task Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 > > > to executor(1)@10.17.178.97:58109 > > > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] Received > > status > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for task > > > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > > > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage 16.54%. Max > > > allowed age: 5.142519016183785days > > > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage 16.90%. Max > > > allowed age: 5.117054416194364days > > > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage 17.29%. Max > > > allowed age: 5.089524142974595days > > > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage 17.61%. Max > > > allowed age: 5.067278719463831days > > > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage 17.87%. Max > > > allowed age: 5.049103921517002days > > > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage 17.97%. Max > > > allowed age: 5.041750330234468days > > > I0613 00:17:28.162240 1212 slave.cpp:1896] [email protected]:5050 > > exited > > > W0613 00:17:28.162505 1212 slave.cpp:1899] Master disconnected! > Waiting > > > for a new master to be elected > > > I0613 00:17:38.006042 1201 detector.cpp:420] Master detector > (slave(1)@ > > > 10.17.178.97:5051) found 1 registered masters > > > I0613 00:17:38.008129 1201 detector.cpp:467] Master detector > (slave(1)@ > > > 10.17.178.97:5051) got new master pid: [email protected]:5050 > > > I0613 00:17:38.008599 1201 slave.cpp:537] New master detected at > > > [email protected]:5050 > > > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] New master > > > detected at [email protected]:5050 > > > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered with master > > > [email protected]:5050 > > > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating framework > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ > > > 10.17.184.87:57804 > > > I0613 00:17:46.057116 1202 detector.cpp:420] Master detector > (slave(1)@ > > > 10.17.178.97:5051) found 2 registered masters > > > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage 18.06%. Max > > > allowed age: 5.036055129997118days > > > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage 17.98%. Max > > > allowed age: 5.041545235716157days > > > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage 17.86%. Max > > > allowed age: 5.049504905524051days > > > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage 17.86%. Max > > > allowed age: 5.049908279608947days > > > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage 17.55%. Max > > > allowed age: 5.071801618813565days > > > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage 17.56%. Max > > > allowed age: 5.070852822503368days > > > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage 17.56%. Max > > > allowed age: 5.070920390650185days > > > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage 17.56%. Max > > > allowed age: 5.070920003070000days > > > *** Aborted at 1371083126 (unix time) try "date -d @1371083126" if you > > are > > > using GNU date *** > > > PC: @ 0x7f53c240dd84 __pthread_cond_wait > > > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740) from PID > > > 1033; stack trace: *** > > > @ 0x7f53c2411cb0 (unknown) > > > @ 0x7f53c240dd84 __pthread_cond_wait > > > @ 0x7f53c3088f03 (unknown) > > > @ 0x7f53c308961f (unknown) > > > @ 0x40c75a (unknown) > > > @ 0x7f53c206476d (unknown) > > > @ 0x40d511 (unknown) > > > I0613 00:25:26.600497 9946 main.cpp:119] Creating "cgroups" isolator > > > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 22:53:54 by > > > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos slave > > >
