Unfortunately I didn't see you comment in review board until after I came to the same conclusion myself :)
Anyway, ignoring ESRCH seems to be the trick, but I'll need further testing. On Fri, Jun 14, 2013 at 10:07 AM, Benjamin Mahler <[email protected] > wrote: > If we get a SIGTERM, the slave will terminate, and we added code to dump a > stack trace on the way out, so that part is expected. > > I do see some failures as well: > F0614 01:00:59.424217 29323 cgroups_isolator.cpp:1152] Failed to destroy > cgroup > > mesos/framework_201306131743-1707151626-5050-13253-0001_executor_executor_Task_Tracker_70_tag_ab6a1b15-95f9-4a32-afd2-1e4b8ffb3b9b: > Failed to kill tasks in nested cgroups: Collect failed: Failed to send > Killed to process 27546: No such process > > I would suspect this is due to the kill code as well, the bug is in > cgroups::kill, it currently assumes the cgroup is frozen! We'll have to > relax that in your change (I'll comment there). > > The fact that the slave never correctly registered is also interesting, > glad to help debug that further! > > > On Thu, Jun 13, 2013 at 8:37 PM, Brenden Matthews < > [email protected]> wrote: > > > PS: I think it was crashing due to the cgroup killTasks() change I made. > > Maybe it was because of SIGCONT? I'll have to investigate this further. > > > > > > On Thu, Jun 13, 2013 at 8:10 PM, Brenden Matthews < > > [email protected]> wrote: > > > > > But shortly after the first 'incident', it says: > > > > > > I0614 01:00:59.320652 29256 main.cpp:119] Creating "cgroups" isolator > > > I0614 01:00:59.321116 29256 main.cpp:127] Build: 2013-05-01 18:49:40 by > > > I0614 01:00:59.321148 29256 main.cpp:128] Starting Mesos slave > > > > > > This looks like some kind of crash to me, unless I'm missing something. > > I > > > should also mention that after this happened the slave never correctly > > > registered with the master. I needed to manually restart it first. > > > > > > > > > On Thu, Jun 13, 2013 at 7:37 PM, Benjamin Mahler < > > > [email protected]> wrote: > > > > > >> I can see how this is misleading =/. The slave is not crashing, we've > > >> recently installed a signal handler that dumps a stacktrace upon > > receiving > > >> _any_ signal. So these SIGTERMs look like crashes, but that's actually > > us > > >> dumping a stacktrace manually. We may want to clean this up to only > dump > > >> for interesting signals. > > >> > > >> The first step here would be to figure out what is issuing the > SIGTERMs, > > >> can you check out the auth logs of the machine? > > >> > > >> > > >> On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews < > > >> [email protected]> wrote: > > >> > > >> > Okay, this problem is very strange so I'm attaching the entire slave > > >> log. > > >> > This is happening often enough to be a nuisance. You can see it's > > >> > crashing (apparently due to SIGTERM? from what? I don't know). > > >> > > > >> > > > >> > On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler < > > >> > [email protected]> wrote: > > >> > > > >> >> Good idea, we do the same here at twitter. Adding core dumps would > be > > >> very > > >> >> useful for some signals, but there won't be a core dump for > SIGTERM: > > >> >> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7 > > >> >> > > >> >> > > >> >> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews < > > >> >> [email protected]> wrote: > > >> >> > > >> >> > Hmm I didn't even notice that it was SIGTERM. It seemed to stop > > >> rather > > >> >> > spontaneously, and then proceeded to crash several times after > > >> (though I > > >> >> > omitted that part of the log). The slaves are started with > runit, > > >> which > > >> >> > (afaik) doesn't send SIGTERMs randomly. It may have been > something > > >> else. > > >> >> > > > >> >> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && > > ulimit > > >> -c > > >> >> > unlimited' to catch the core dumps. > > >> >> > > > >> >> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler > > >> >> > <[email protected]>wrote: > > >> >> > > > >> >> > > The logs you provided show that a SIGTERM was received by the > > >> slave. > > >> >> Do > > >> >> > you > > >> >> > > have something watching and sometimes killing your slaves? Did > > >> someone > > >> >> > > issue a kill on the process? > > >> >> > > > > >> >> > > > > >> >> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews < > > >> >> > > [email protected]> wrote: > > >> >> > > > > >> >> > > > Hey guys, > > >> >> > > > > > >> >> > > > I was wondering how you typically debug slave crashes. I > > >> frequently > > >> >> > have > > >> >> > > > slaves crash, and I've had limited luck with collecting core > > >> dumps > > >> >> > > because > > >> >> > > > of the frequency with which it occurs (it usually crashes > > >> >> repeatedly, > > >> >> > and > > >> >> > > > the cores get overwritten). I figured it would be quicker to > > >> just > > >> >> ask > > >> >> > > how > > >> >> > > > you've been doing it in the past, rather than trying to > > reinvent > > >> the > > >> >> > > wheel. > > >> >> > > > > > >> >> > > > Here's a sample from the slave log that shows a crash: > > >> >> > > > > > >> >> > > > I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated > > >> >> > > 'cpu.shares' > > >> >> > > > > to 1024 for executor executor_Task_Tracker_129 of framework > > 20 > > >> >> > > > > 1306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] > > Updated > > >> >> > > > > 'memory.limit_in_bytes' to 2621440000 for executor > > >> >> > > > executor_Task_Tracker_1 > > >> >> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] > > Started > > >> >> > > listening > > >> >> > > > > for OOM events for executor executor_Task_Tracker_129 of > > >> framewo > > >> >> > > > > rk 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] > Forked > > >> >> executor > > >> >> > > at > > >> >> > > > = > > >> >> > > > > 6789 > > >> >> > > > > Fetching resources into > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> > > > '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' > > >> >> > > > > Fetching resource > > >> >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > >> >> > > > > Downloading resource from > > >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > >> >> > > > > HDFS command: hadoop fs -copyToLocal > > >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > >> >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > >> >> > > > > Extracting resource: tar xJf > > >> >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > >> >> > > > > I0613 00:11:26.225594 1200 slave.cpp:1412] Got > registration > > >> for > > >> >> > > executor > > >> >> > > > > 'executor_Task_Tracker_129' of framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] > > Changing > > >> >> cgroup > > >> >> > > > > controls for executor executor_Task_Tracker_129 of > framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 with resources > > >> cpus=9.25; > > >> >> > > > mem=21200; > > >> >> > > > > disk=45056; ports=[31000-31000, 31001-31001] > > >> >> > > > > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued > > >> task > > >> >> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' > of > > >> >> > framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] > Updated > > >> >> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to > > 925000 > > >> for > > >> >> > > > executor > > >> >> > > > > executor_Task_Tracker_129 of framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] > Updated > > >> >> > > > 'cpu.shares' > > >> >> > > > > to 9472 for executor executor_Task_Tracker_129 of framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] > > Updated > > >> >> > > > > 'memory.limit_in_bytes' to 22229811200 for executor > > >> >> > > > > executor_Task_Tracker_129 of framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] > > >> >> Received > > >> >> > > > status > > >> >> > > > > update TASK_RUNNING (UUID: > > >> e984eac5-a9d7-4546-a7e9-444f5f018cab) > > >> >> for > > >> >> > > task > > >> >> > > > > Task_Tracker_129 of framework > > >> >> 201306122129-1707151626-5050-5724-0000 > > >> >> > > with > > >> >> > > > > checkpoint=false > > >> >> > > > > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] > > >> >> Creating > > >> >> > > > > StatusUpdate stream for task Task_Tracker_129 of framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] > > >> >> Forwarding > > >> >> > > > > status update TASK_RUNNING (UUID: > > >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) > > >> >> > > > for > > >> >> > > > > task Task_Tracker_129 of framework > > >> >> > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > to [email protected]:5050 > > >> >> > > > > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending > > >> >> acknowledgement > > >> >> > for > > >> >> > > > > status update TASK_RUNNING (UUID: > > >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) > > >> >> > > > for > > >> >> > > > > task Task_Tracker_129 of framework > > >> >> > > 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > to executor(1)@10.17.178.97:58109 > > >> >> > > > > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] > > >> >> Received > > >> >> > > > status > > >> >> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab > > for > > >> >> task > > >> >> > > > > Task_Tracker_129 of framework > > >> >> 201306122129-1707151626-5050-5724-0000 > > >> >> > > > > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage > > >> 16.54%. > > >> >> Max > > >> >> > > > > allowed age: 5.142519016183785days > > >> >> > > > > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage > > >> 16.90%. > > >> >> Max > > >> >> > > > > allowed age: 5.117054416194364days > > >> >> > > > > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage > > >> 17.29%. > > >> >> Max > > >> >> > > > > allowed age: 5.089524142974595days > > >> >> > > > > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage > > >> 17.61%. > > >> >> Max > > >> >> > > > > allowed age: 5.067278719463831days > > >> >> > > > > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage > > >> 17.87%. > > >> >> Max > > >> >> > > > > allowed age: 5.049103921517002days > > >> >> > > > > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage > > >> 17.97%. > > >> >> Max > > >> >> > > > > allowed age: 5.041750330234468days > > >> >> > > > > I0613 00:17:28.162240 1212 slave.cpp:1896] > > >> >> > [email protected]:5050 > > >> >> > > > exited > > >> >> > > > > W0613 00:17:28.162505 1212 slave.cpp:1899] Master > > >> disconnected! > > >> >> > > Waiting > > >> >> > > > > for a new master to be elected > > >> >> > > > > I0613 00:17:38.006042 1201 detector.cpp:420] Master > detector > > >> >> > > (slave(1)@ > > >> >> > > > > 10.17.178.97:5051) found 1 registered masters > > >> >> > > > > I0613 00:17:38.008129 1201 detector.cpp:467] Master > detector > > >> >> > > (slave(1)@ > > >> >> > > > > 10.17.178.97:5051) got new master pid: > > >> [email protected]:5050 > > >> >> > > > > I0613 00:17:38.008599 1201 slave.cpp:537] New master > > detected > > >> at > > >> >> > > > > [email protected]:5050 > > >> >> > > > > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] > > New > > >> >> master > > >> >> > > > > detected at [email protected]:5050 > > >> >> > > > > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered > with > > >> >> master > > >> >> > > > > [email protected]:5050 > > >> >> > > > > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating > > framework > > >> >> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ > > >> >> > > > > 10.17.184.87:57804 > > >> >> > > > > I0613 00:17:46.057116 1202 detector.cpp:420] Master > detector > > >> >> > > (slave(1)@ > > >> >> > > > > 10.17.178.97:5051) found 2 registered masters > > >> >> > > > > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage > > >> 18.06%. > > >> >> Max > > >> >> > > > > allowed age: 5.036055129997118days > > >> >> > > > > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage > > >> 17.98%. > > >> >> Max > > >> >> > > > > allowed age: 5.041545235716157days > > >> >> > > > > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage > > >> 17.86%. > > >> >> Max > > >> >> > > > > allowed age: 5.049504905524051days > > >> >> > > > > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage > > >> 17.86%. > > >> >> Max > > >> >> > > > > allowed age: 5.049908279608947days > > >> >> > > > > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage > > >> 17.55%. > > >> >> Max > > >> >> > > > > allowed age: 5.071801618813565days > > >> >> > > > > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage > > >> 17.56%. > > >> >> Max > > >> >> > > > > allowed age: 5.070852822503368days > > >> >> > > > > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage > > >> 17.56%. > > >> >> Max > > >> >> > > > > allowed age: 5.070920390650185days > > >> >> > > > > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage > > >> 17.56%. > > >> >> Max > > >> >> > > > > allowed age: 5.070920003070000days > > >> >> > > > > *** Aborted at 1371083126 (unix time) try "date -d > > >> @1371083126" if > > >> >> > you > > >> >> > > > are > > >> >> > > > > using GNU date *** > > >> >> > > > > PC: @ 0x7f53c240dd84 __pthread_cond_wait > > >> >> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID > > 0x7f53c37b8740) > > >> >> from > > >> >> > PID > > >> >> > > > > 1033; stack trace: *** > > >> >> > > > > @ 0x7f53c2411cb0 (unknown) > > >> >> > > > > @ 0x7f53c240dd84 __pthread_cond_wait > > >> >> > > > > @ 0x7f53c3088f03 (unknown) > > >> >> > > > > @ 0x7f53c308961f (unknown) > > >> >> > > > > @ 0x40c75a (unknown) > > >> >> > > > > @ 0x7f53c206476d (unknown) > > >> >> > > > > @ 0x40d511 (unknown) > > >> >> > > > > I0613 00:25:26.600497 9946 main.cpp:119] Creating > "cgroups" > > >> >> isolator > > >> >> > > > > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 > > >> >> 22:53:54 > > >> >> > by > > >> >> > > > > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos > > slave > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> > > > >> > > > >> > > > > > > > > >
