But shortly after the first 'incident', it says: I0614 01:00:59.320652 29256 main.cpp:119] Creating "cgroups" isolator I0614 01:00:59.321116 29256 main.cpp:127] Build: 2013-05-01 18:49:40 by I0614 01:00:59.321148 29256 main.cpp:128] Starting Mesos slave
This looks like some kind of crash to me, unless I'm missing something. I should also mention that after this happened the slave never correctly registered with the master. I needed to manually restart it first. On Thu, Jun 13, 2013 at 7:37 PM, Benjamin Mahler <[email protected]>wrote: > I can see how this is misleading =/. The slave is not crashing, we've > recently installed a signal handler that dumps a stacktrace upon receiving > _any_ signal. So these SIGTERMs look like crashes, but that's actually us > dumping a stacktrace manually. We may want to clean this up to only dump > for interesting signals. > > The first step here would be to figure out what is issuing the SIGTERMs, > can you check out the auth logs of the machine? > > > On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews < > [email protected]> wrote: > > > Okay, this problem is very strange so I'm attaching the entire slave log. > > This is happening often enough to be a nuisance. You can see it's > > crashing (apparently due to SIGTERM? from what? I don't know). > > > > > > On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler < > > [email protected]> wrote: > > > >> Good idea, we do the same here at twitter. Adding core dumps would be > very > >> useful for some signals, but there won't be a core dump for SIGTERM: > >> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7 > >> > >> > >> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews < > >> [email protected]> wrote: > >> > >> > Hmm I didn't even notice that it was SIGTERM. It seemed to stop > rather > >> > spontaneously, and then proceeded to crash several times after > (though I > >> > omitted that part of the log). The slaves are started with runit, > which > >> > (afaik) doesn't send SIGTERMs randomly. It may have been something > else. > >> > > >> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && ulimit > -c > >> > unlimited' to catch the core dumps. > >> > > >> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler > >> > <[email protected]>wrote: > >> > > >> > > The logs you provided show that a SIGTERM was received by the slave. > >> Do > >> > you > >> > > have something watching and sometimes killing your slaves? Did > someone > >> > > issue a kill on the process? > >> > > > >> > > > >> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews < > >> > > [email protected]> wrote: > >> > > > >> > > > Hey guys, > >> > > > > >> > > > I was wondering how you typically debug slave crashes. I > frequently > >> > have > >> > > > slaves crash, and I've had limited luck with collecting core dumps > >> > > because > >> > > > of the frequency with which it occurs (it usually crashes > >> repeatedly, > >> > and > >> > > > the cores get overwritten). I figured it would be quicker to just > >> ask > >> > > how > >> > > > you've been doing it in the past, rather than trying to reinvent > the > >> > > wheel. > >> > > > > >> > > > Here's a sample from the slave log that shows a crash: > >> > > > > >> > > > I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated > >> > > 'cpu.shares' > >> > > > > to 1024 for executor executor_Task_Tracker_129 of framework 20 > >> > > > > 1306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] Updated > >> > > > > 'memory.limit_in_bytes' to 2621440000 for executor > >> > > > executor_Task_Tracker_1 > >> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] Started > >> > > listening > >> > > > > for OOM events for executor executor_Task_Tracker_129 of framewo > >> > > > > rk 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] Forked > >> executor > >> > > at > >> > > > = > >> > > > > 6789 > >> > > > > Fetching resources into > >> > > > > > >> > > > > >> > > > >> > > >> > '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' > >> > > > > Fetching resource > >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> > > > > Downloading resource from > >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> > > > > HDFS command: hadoop fs -copyToLocal > >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> > > > > Extracting resource: tar xJf > >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > >> > > > > I0613 00:11:26.225594 1200 slave.cpp:1412] Got registration for > >> > > executor > >> > > > > 'executor_Task_Tracker_129' of framework > >> > > > > 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] Changing > >> cgroup > >> > > > > controls for executor executor_Task_Tracker_129 of framework > >> > > > > 201306122129-1707151626-5050-5724-0000 with resources cpus=9.25; > >> > > > mem=21200; > >> > > > > disk=45056; ports=[31000-31000, 31001-31001] > >> > > > > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued task > >> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of > >> > framework > >> > > > > 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] Updated > >> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 > for > >> > > > executor > >> > > > > executor_Task_Tracker_129 of framework > >> > > > > 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] Updated > >> > > > 'cpu.shares' > >> > > > > to 9472 for executor executor_Task_Tracker_129 of framework > >> > > > > 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] Updated > >> > > > > 'memory.limit_in_bytes' to 22229811200 for executor > >> > > > > executor_Task_Tracker_129 of framework > >> > > > > 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] > >> Received > >> > > > status > >> > > > > update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) > >> for > >> > > task > >> > > > > Task_Tracker_129 of framework > >> 201306122129-1707151626-5050-5724-0000 > >> > > with > >> > > > > checkpoint=false > >> > > > > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] > >> Creating > >> > > > > StatusUpdate stream for task Task_Tracker_129 of framework > >> > > > > 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] > >> Forwarding > >> > > > > status update TASK_RUNNING (UUID: > >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) > >> > > > for > >> > > > > task Task_Tracker_129 of framework > >> > > 201306122129-1707151626-5050-5724-0000 > >> > > > > to [email protected]:5050 > >> > > > > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending > >> acknowledgement > >> > for > >> > > > > status update TASK_RUNNING (UUID: > >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) > >> > > > for > >> > > > > task Task_Tracker_129 of framework > >> > > 201306122129-1707151626-5050-5724-0000 > >> > > > > to executor(1)@10.17.178.97:58109 > >> > > > > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] > >> Received > >> > > > status > >> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for > >> task > >> > > > > Task_Tracker_129 of framework > >> 201306122129-1707151626-5050-5724-0000 > >> > > > > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage > 16.54%. > >> Max > >> > > > > allowed age: 5.142519016183785days > >> > > > > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage > 16.90%. > >> Max > >> > > > > allowed age: 5.117054416194364days > >> > > > > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage > 17.29%. > >> Max > >> > > > > allowed age: 5.089524142974595days > >> > > > > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage > 17.61%. > >> Max > >> > > > > allowed age: 5.067278719463831days > >> > > > > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage > 17.87%. > >> Max > >> > > > > allowed age: 5.049103921517002days > >> > > > > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage > 17.97%. > >> Max > >> > > > > allowed age: 5.041750330234468days > >> > > > > I0613 00:17:28.162240 1212 slave.cpp:1896] > >> > [email protected]:5050 > >> > > > exited > >> > > > > W0613 00:17:28.162505 1212 slave.cpp:1899] Master disconnected! > >> > > Waiting > >> > > > > for a new master to be elected > >> > > > > I0613 00:17:38.006042 1201 detector.cpp:420] Master detector > >> > > (slave(1)@ > >> > > > > 10.17.178.97:5051) found 1 registered masters > >> > > > > I0613 00:17:38.008129 1201 detector.cpp:467] Master detector > >> > > (slave(1)@ > >> > > > > 10.17.178.97:5051) got new master pid: > [email protected]:5050 > >> > > > > I0613 00:17:38.008599 1201 slave.cpp:537] New master detected > at > >> > > > > [email protected]:5050 > >> > > > > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] New > >> master > >> > > > > detected at [email protected]:5050 > >> > > > > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered with > >> master > >> > > > > [email protected]:5050 > >> > > > > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating framework > >> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ > >> > > > > 10.17.184.87:57804 > >> > > > > I0613 00:17:46.057116 1202 detector.cpp:420] Master detector > >> > > (slave(1)@ > >> > > > > 10.17.178.97:5051) found 2 registered masters > >> > > > > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage > 18.06%. > >> Max > >> > > > > allowed age: 5.036055129997118days > >> > > > > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage > 17.98%. > >> Max > >> > > > > allowed age: 5.041545235716157days > >> > > > > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage > 17.86%. > >> Max > >> > > > > allowed age: 5.049504905524051days > >> > > > > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage > 17.86%. > >> Max > >> > > > > allowed age: 5.049908279608947days > >> > > > > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage > 17.55%. > >> Max > >> > > > > allowed age: 5.071801618813565days > >> > > > > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage > 17.56%. > >> Max > >> > > > > allowed age: 5.070852822503368days > >> > > > > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage > 17.56%. > >> Max > >> > > > > allowed age: 5.070920390650185days > >> > > > > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage > 17.56%. > >> Max > >> > > > > allowed age: 5.070920003070000days > >> > > > > *** Aborted at 1371083126 (unix time) try "date -d @1371083126" > if > >> > you > >> > > > are > >> > > > > using GNU date *** > >> > > > > PC: @ 0x7f53c240dd84 __pthread_cond_wait > >> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740) > >> from > >> > PID > >> > > > > 1033; stack trace: *** > >> > > > > @ 0x7f53c2411cb0 (unknown) > >> > > > > @ 0x7f53c240dd84 __pthread_cond_wait > >> > > > > @ 0x7f53c3088f03 (unknown) > >> > > > > @ 0x7f53c308961f (unknown) > >> > > > > @ 0x40c75a (unknown) > >> > > > > @ 0x7f53c206476d (unknown) > >> > > > > @ 0x40d511 (unknown) > >> > > > > I0613 00:25:26.600497 9946 main.cpp:119] Creating "cgroups" > >> isolator > >> > > > > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 > >> 22:53:54 > >> > by > >> > > > > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos slave > >> > > > > >> > > > >> > > >> > > > > >
