Good idea, we do the same here at twitter. Adding core dumps would be very useful for some signals, but there won't be a core dump for SIGTERM: http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7
On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews < [email protected]> wrote: > Hmm I didn't even notice that it was SIGTERM. It seemed to stop rather > spontaneously, and then proceeded to crash several times after (though I > omitted that part of the log). The slaves are started with runit, which > (afaik) doesn't send SIGTERMs randomly. It may have been something else. > > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && ulimit -c > unlimited' to catch the core dumps. > > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler > <[email protected]>wrote: > > > The logs you provided show that a SIGTERM was received by the slave. Do > you > > have something watching and sometimes killing your slaves? Did someone > > issue a kill on the process? > > > > > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews < > > [email protected]> wrote: > > > > > Hey guys, > > > > > > I was wondering how you typically debug slave crashes. I frequently > have > > > slaves crash, and I've had limited luck with collecting core dumps > > because > > > of the frequency with which it occurs (it usually crashes repeatedly, > and > > > the cores get overwritten). I figured it would be quicker to just ask > > how > > > you've been doing it in the past, rather than trying to reinvent the > > wheel. > > > > > > Here's a sample from the slave log that shows a crash: > > > > > > I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated > > 'cpu.shares' > > > > to 1024 for executor executor_Task_Tracker_129 of framework 20 > > > > 1306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] Updated > > > > 'memory.limit_in_bytes' to 2621440000 for executor > > > executor_Task_Tracker_1 > > > > 29 of framework 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] Started > > listening > > > > for OOM events for executor executor_Task_Tracker_129 of framewo > > > > rk 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] Forked executor > > at > > > = > > > > 6789 > > > > Fetching resources into > > > > > > > > > > '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' > > > > Fetching resource > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > Downloading resource from > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > HDFS command: hadoop fs -copyToLocal > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > Extracting resource: tar xJf > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > I0613 00:11:26.225594 1200 slave.cpp:1412] Got registration for > > executor > > > > 'executor_Task_Tracker_129' of framework > > > > 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] Changing cgroup > > > > controls for executor executor_Task_Tracker_129 of framework > > > > 201306122129-1707151626-5050-5724-0000 with resources cpus=9.25; > > > mem=21200; > > > > disk=45056; ports=[31000-31000, 31001-31001] > > > > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued task > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of > framework > > > > 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] Updated > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 for > > > executor > > > > executor_Task_Tracker_129 of framework > > > > 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] Updated > > > 'cpu.shares' > > > > to 9472 for executor executor_Task_Tracker_129 of framework > > > > 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] Updated > > > > 'memory.limit_in_bytes' to 22229811200 for executor > > > > executor_Task_Tracker_129 of framework > > > > 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] Received > > > status > > > > update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) for > > task > > > > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > > with > > > > checkpoint=false > > > > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] Creating > > > > StatusUpdate stream for task Task_Tracker_129 of framework > > > > 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] Forwarding > > > > status update TASK_RUNNING (UUID: > e984eac5-a9d7-4546-a7e9-444f5f018cab) > > > for > > > > task Task_Tracker_129 of framework > > 201306122129-1707151626-5050-5724-0000 > > > > to [email protected]:5050 > > > > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending acknowledgement > for > > > > status update TASK_RUNNING (UUID: > e984eac5-a9d7-4546-a7e9-444f5f018cab) > > > for > > > > task Task_Tracker_129 of framework > > 201306122129-1707151626-5050-5724-0000 > > > > to executor(1)@10.17.178.97:58109 > > > > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] Received > > > status > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for task > > > > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > > > > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage 16.54%. Max > > > > allowed age: 5.142519016183785days > > > > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage 16.90%. Max > > > > allowed age: 5.117054416194364days > > > > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage 17.29%. Max > > > > allowed age: 5.089524142974595days > > > > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage 17.61%. Max > > > > allowed age: 5.067278719463831days > > > > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage 17.87%. Max > > > > allowed age: 5.049103921517002days > > > > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage 17.97%. Max > > > > allowed age: 5.041750330234468days > > > > I0613 00:17:28.162240 1212 slave.cpp:1896] > [email protected]:5050 > > > exited > > > > W0613 00:17:28.162505 1212 slave.cpp:1899] Master disconnected! > > Waiting > > > > for a new master to be elected > > > > I0613 00:17:38.006042 1201 detector.cpp:420] Master detector > > (slave(1)@ > > > > 10.17.178.97:5051) found 1 registered masters > > > > I0613 00:17:38.008129 1201 detector.cpp:467] Master detector > > (slave(1)@ > > > > 10.17.178.97:5051) got new master pid: [email protected]:5050 > > > > I0613 00:17:38.008599 1201 slave.cpp:537] New master detected at > > > > [email protected]:5050 > > > > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] New master > > > > detected at [email protected]:5050 > > > > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered with master > > > > [email protected]:5050 > > > > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating framework > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ > > > > 10.17.184.87:57804 > > > > I0613 00:17:46.057116 1202 detector.cpp:420] Master detector > > (slave(1)@ > > > > 10.17.178.97:5051) found 2 registered masters > > > > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage 18.06%. Max > > > > allowed age: 5.036055129997118days > > > > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage 17.98%. Max > > > > allowed age: 5.041545235716157days > > > > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage 17.86%. Max > > > > allowed age: 5.049504905524051days > > > > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage 17.86%. Max > > > > allowed age: 5.049908279608947days > > > > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage 17.55%. Max > > > > allowed age: 5.071801618813565days > > > > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage 17.56%. Max > > > > allowed age: 5.070852822503368days > > > > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage 17.56%. Max > > > > allowed age: 5.070920390650185days > > > > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage 17.56%. Max > > > > allowed age: 5.070920003070000days > > > > *** Aborted at 1371083126 (unix time) try "date -d @1371083126" if > you > > > are > > > > using GNU date *** > > > > PC: @ 0x7f53c240dd84 __pthread_cond_wait > > > > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740) from > PID > > > > 1033; stack trace: *** > > > > @ 0x7f53c2411cb0 (unknown) > > > > @ 0x7f53c240dd84 __pthread_cond_wait > > > > @ 0x7f53c3088f03 (unknown) > > > > @ 0x7f53c308961f (unknown) > > > > @ 0x40c75a (unknown) > > > > @ 0x7f53c206476d (unknown) > > > > @ 0x40d511 (unknown) > > > > I0613 00:25:26.600497 9946 main.cpp:119] Creating "cgroups" isolator > > > > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 22:53:54 > by > > > > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos slave > > > > > >
