If we get a SIGTERM, the slave will terminate, and we added code to dump a stack trace on the way out, so that part is expected.
I do see some failures as well: F0614 01:00:59.424217 29323 cgroups_isolator.cpp:1152] Failed to destroy cgroup mesos/framework_201306131743-1707151626-5050-13253-0001_executor_executor_Task_Tracker_70_tag_ab6a1b15-95f9-4a32-afd2-1e4b8ffb3b9b: Failed to kill tasks in nested cgroups: Collect failed: Failed to send Killed to process 27546: No such process I would suspect this is due to the kill code as well, the bug is in cgroups::kill, it currently assumes the cgroup is frozen! We'll have to relax that in your change (I'll comment there). The fact that the slave never correctly registered is also interesting, glad to help debug that further! On Thu, Jun 13, 2013 at 8:37 PM, Brenden Matthews < [email protected]> wrote: > PS: I think it was crashing due to the cgroup killTasks() change I made. > Maybe it was because of SIGCONT? I'll have to investigate this further. > > > On Thu, Jun 13, 2013 at 8:10 PM, Brenden Matthews < > [email protected]> wrote: > > > But shortly after the first 'incident', it says: > > > > I0614 01:00:59.320652 29256 main.cpp:119] Creating "cgroups" isolator > > I0614 01:00:59.321116 29256 main.cpp:127] Build: 2013-05-01 18:49:40 by > > I0614 01:00:59.321148 29256 main.cpp:128] Starting Mesos slave > > > > This looks like some kind of crash to me, unless I'm missing something. > I > > should also mention that after this happened the slave never correctly > > registered with the master. I needed to manually restart it first. > > > > > > On Thu, Jun 13, 2013 at 7:37 PM, Benjamin Mahler < > > [email protected]> wrote: > > > >> I can see how this is misleading =/. The slave is not crashing, we've > >> recently installed a signal handler that dumps a stacktrace upon > receiving > >> _any_ signal. So these SIGTERMs look like crashes, but that's actually > us > >> dumping a stacktrace manually. We may want to clean this up to only dump > >> for interesting signals. > >> > >> The first step here would be to figure out what is issuing the SIGTERMs, > >> can you check out the auth logs of the machine? > >> > >> > >> On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews < > >> [email protected]> wrote: > >> > >> > Okay, this problem is very strange so I'm attaching the entire slave > >> log. > >> > This is happening often enough to be a nuisance. You can see it's > >> > crashing (apparently due to SIGTERM? from what? I don't know). > >> > > >> > > >> > On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler < > >> > [email protected]> wrote: > >> > > >> >> Good idea, we do the same here at twitter. Adding core dumps would be > >> very > >> >> useful for some signals, but there won't be a core dump for SIGTERM: > >> >> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7 > >> >> > >> >> > >> >> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews < > >> >> [email protected]> wrote: > >> >> > >> >> > Hmm I didn't even notice that it was SIGTERM. It seemed to stop > >> rather > >> >> > spontaneously, and then proceeded to crash several times after > >> (though I > >> >> > omitted that part of the log). The slaves are started with runit, > >> which > >> >> > (afaik) doesn't send SIGTERMs randomly. It may have been something > >> else. > >> >> > > >> >> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && > ulimit > >> -c > >> >> > unlimited' to catch the core dumps. > >> >> > > >> >> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler > >> >> > <[email protected]>wrote: > >> >> > > >> >> > > The logs you provided show that a SIGTERM was received by the > >> slave. > >> >> Do > >> >> > you > >> >> > > have something watching and sometimes killing your slaves? Did > >> someone > >> >> > > issue a kill on the process? > >> >> > > > >> >> > > > >> >> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews < > >> >> > > [email protected]> wrote: > >> >> > > > >> >> > > > Hey guys, > >> >> > > > > >> >> > > > I was wondering how you typically debug slave crashes. I > >> frequently > >> >> > have > >> >> > > > slaves crash, and I've had limited luck with collecting core > >> dumps > >> >> > > because > >> >> > > > of the frequency with which it occurs (it usually crashes > >> >> repeatedly, > >> >> > and > >> >> > > > the cores get overwritten). I figured it would be quicker to > >> just > >> >> ask > >> >> > > how > >> >> > > > you've been doing it in the past, rather than trying to > reinvent > >> the > >> >> > > wheel. > >> >> > > > > >> >> > > > Here's a sample from the slave log that shows a crash: > >> >> > > > > >> >> > > > I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated > >> >> > > 'cpu.shares' > >> >> > > > > to 1024 for executor executor_Task_Tracker_129 of framework > 20 > >> >> > > > > 1306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] > Updated > >> >> > > > > 'memory.limit_in_bytes' to 2621440000 for executor > >> >> > > > executor_Task_Tracker_1 > >> >> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] > Started > >> >> > > listening > >> >> > > > > for OOM events for executor executor_Task_Tracker_129 of > >> framewo > >> >> > > > > rk 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] Forked > >> >> executor > >> >> > > at > >> >> > > > = > >> >> > > > > 6789 > >> >> > > > > Fetching resources into > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> > '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' > >> >> > > > > Fetching resource > >> >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> >> > > > > Downloading resource from > >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> >> > > > > HDFS command: hadoop fs -copyToLocal > >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> >> > > > > Extracting resource: tar xJf > >> >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> >> > > > > I0613 00:11:26.225594 1200 slave.cpp:1412] Got registration > >> for > >> >> > > executor > >> >> > > > > 'executor_Task_Tracker_129' of framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] > Changing > >> >> cgroup > >> >> > > > > controls for executor executor_Task_Tracker_129 of framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 with resources > >> cpus=9.25; > >> >> > > > mem=21200; > >> >> > > > > disk=45056; ports=[31000-31000, 31001-31001] > >> >> > > > > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued > >> task > >> >> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of > >> >> > framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] Updated > >> >> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to > 925000 > >> for > >> >> > > > executor > >> >> > > > > executor_Task_Tracker_129 of framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] Updated > >> >> > > > 'cpu.shares' > >> >> > > > > to 9472 for executor executor_Task_Tracker_129 of framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] > Updated > >> >> > > > > 'memory.limit_in_bytes' to 22229811200 for executor > >> >> > > > > executor_Task_Tracker_129 of framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] > >> >> Received > >> >> > > > status > >> >> > > > > update TASK_RUNNING (UUID: > >> e984eac5-a9d7-4546-a7e9-444f5f018cab) > >> >> for > >> >> > > task > >> >> > > > > Task_Tracker_129 of framework > >> >> 201306122129-1707151626-5050-5724-0000 > >> >> > > with > >> >> > > > > checkpoint=false > >> >> > > > > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] > >> >> Creating > >> >> > > > > StatusUpdate stream for task Task_Tracker_129 of framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] > >> >> Forwarding > >> >> > > > > status update TASK_RUNNING (UUID: > >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) > >> >> > > > for > >> >> > > > > task Task_Tracker_129 of framework > >> >> > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > to [email protected]:5050 > >> >> > > > > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending > >> >> acknowledgement > >> >> > for > >> >> > > > > status update TASK_RUNNING (UUID: > >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) > >> >> > > > for > >> >> > > > > task Task_Tracker_129 of framework > >> >> > > 201306122129-1707151626-5050-5724-0000 > >> >> > > > > to executor(1)@10.17.178.97:58109 > >> >> > > > > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] > >> >> Received > >> >> > > > status > >> >> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab > for > >> >> task > >> >> > > > > Task_Tracker_129 of framework > >> >> 201306122129-1707151626-5050-5724-0000 > >> >> > > > > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage > >> 16.54%. > >> >> Max > >> >> > > > > allowed age: 5.142519016183785days > >> >> > > > > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage > >> 16.90%. > >> >> Max > >> >> > > > > allowed age: 5.117054416194364days > >> >> > > > > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage > >> 17.29%. > >> >> Max > >> >> > > > > allowed age: 5.089524142974595days > >> >> > > > > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage > >> 17.61%. > >> >> Max > >> >> > > > > allowed age: 5.067278719463831days > >> >> > > > > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage > >> 17.87%. > >> >> Max > >> >> > > > > allowed age: 5.049103921517002days > >> >> > > > > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage > >> 17.97%. > >> >> Max > >> >> > > > > allowed age: 5.041750330234468days > >> >> > > > > I0613 00:17:28.162240 1212 slave.cpp:1896] > >> >> > [email protected]:5050 > >> >> > > > exited > >> >> > > > > W0613 00:17:28.162505 1212 slave.cpp:1899] Master > >> disconnected! > >> >> > > Waiting > >> >> > > > > for a new master to be elected > >> >> > > > > I0613 00:17:38.006042 1201 detector.cpp:420] Master detector > >> >> > > (slave(1)@ > >> >> > > > > 10.17.178.97:5051) found 1 registered masters > >> >> > > > > I0613 00:17:38.008129 1201 detector.cpp:467] Master detector > >> >> > > (slave(1)@ > >> >> > > > > 10.17.178.97:5051) got new master pid: > >> [email protected]:5050 > >> >> > > > > I0613 00:17:38.008599 1201 slave.cpp:537] New master > detected > >> at > >> >> > > > > [email protected]:5050 > >> >> > > > > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] > New > >> >> master > >> >> > > > > detected at [email protected]:5050 > >> >> > > > > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered with > >> >> master > >> >> > > > > [email protected]:5050 > >> >> > > > > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating > framework > >> >> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ > >> >> > > > > 10.17.184.87:57804 > >> >> > > > > I0613 00:17:46.057116 1202 detector.cpp:420] Master detector > >> >> > > (slave(1)@ > >> >> > > > > 10.17.178.97:5051) found 2 registered masters > >> >> > > > > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage > >> 18.06%. > >> >> Max > >> >> > > > > allowed age: 5.036055129997118days > >> >> > > > > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage > >> 17.98%. > >> >> Max > >> >> > > > > allowed age: 5.041545235716157days > >> >> > > > > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage > >> 17.86%. > >> >> Max > >> >> > > > > allowed age: 5.049504905524051days > >> >> > > > > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage > >> 17.86%. > >> >> Max > >> >> > > > > allowed age: 5.049908279608947days > >> >> > > > > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage > >> 17.55%. > >> >> Max > >> >> > > > > allowed age: 5.071801618813565days > >> >> > > > > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage > >> 17.56%. > >> >> Max > >> >> > > > > allowed age: 5.070852822503368days > >> >> > > > > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage > >> 17.56%. > >> >> Max > >> >> > > > > allowed age: 5.070920390650185days > >> >> > > > > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage > >> 17.56%. > >> >> Max > >> >> > > > > allowed age: 5.070920003070000days > >> >> > > > > *** Aborted at 1371083126 (unix time) try "date -d > >> @1371083126" if > >> >> > you > >> >> > > > are > >> >> > > > > using GNU date *** > >> >> > > > > PC: @ 0x7f53c240dd84 __pthread_cond_wait > >> >> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID > 0x7f53c37b8740) > >> >> from > >> >> > PID > >> >> > > > > 1033; stack trace: *** > >> >> > > > > @ 0x7f53c2411cb0 (unknown) > >> >> > > > > @ 0x7f53c240dd84 __pthread_cond_wait > >> >> > > > > @ 0x7f53c3088f03 (unknown) > >> >> > > > > @ 0x7f53c308961f (unknown) > >> >> > > > > @ 0x40c75a (unknown) > >> >> > > > > @ 0x7f53c206476d (unknown) > >> >> > > > > @ 0x40d511 (unknown) > >> >> > > > > I0613 00:25:26.600497 9946 main.cpp:119] Creating "cgroups" > >> >> isolator > >> >> > > > > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 > >> >> 22:53:54 > >> >> > by > >> >> > > > > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos > slave > >> >> > > > > >> >> > > > >> >> > > >> >> > >> > > >> > > >> > > > > >
