PS: I think it was crashing due to the cgroup killTasks() change I made. Maybe it was because of SIGCONT? I'll have to investigate this further.
On Thu, Jun 13, 2013 at 8:10 PM, Brenden Matthews < [email protected]> wrote: > But shortly after the first 'incident', it says: > > I0614 01:00:59.320652 29256 main.cpp:119] Creating "cgroups" isolator > I0614 01:00:59.321116 29256 main.cpp:127] Build: 2013-05-01 18:49:40 by > I0614 01:00:59.321148 29256 main.cpp:128] Starting Mesos slave > > This looks like some kind of crash to me, unless I'm missing something. I > should also mention that after this happened the slave never correctly > registered with the master. I needed to manually restart it first. > > > On Thu, Jun 13, 2013 at 7:37 PM, Benjamin Mahler < > [email protected]> wrote: > >> I can see how this is misleading =/. The slave is not crashing, we've >> recently installed a signal handler that dumps a stacktrace upon receiving >> _any_ signal. So these SIGTERMs look like crashes, but that's actually us >> dumping a stacktrace manually. We may want to clean this up to only dump >> for interesting signals. >> >> The first step here would be to figure out what is issuing the SIGTERMs, >> can you check out the auth logs of the machine? >> >> >> On Thu, Jun 13, 2013 at 6:36 PM, Brenden Matthews < >> [email protected]> wrote: >> >> > Okay, this problem is very strange so I'm attaching the entire slave >> log. >> > This is happening often enough to be a nuisance. You can see it's >> > crashing (apparently due to SIGTERM? from what? I don't know). >> > >> > >> > On Thu, Jun 13, 2013 at 4:37 PM, Benjamin Mahler < >> > [email protected]> wrote: >> > >> >> Good idea, we do the same here at twitter. Adding core dumps would be >> very >> >> useful for some signals, but there won't be a core dump for SIGTERM: >> >> http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7 >> >> >> >> >> >> On Thu, Jun 13, 2013 at 3:29 PM, Brenden Matthews < >> >> [email protected]> wrote: >> >> >> >> > Hmm I didn't even notice that it was SIGTERM. It seemed to stop >> rather >> >> > spontaneously, and then proceeded to crash several times after >> (though I >> >> > omitted that part of the log). The slaves are started with runit, >> which >> >> > (afaik) doesn't send SIGTERMs randomly. It may have been something >> else. >> >> > >> >> > In the mean time, I've done 'sysctl kernel.core_uses_pid=1 && ulimit >> -c >> >> > unlimited' to catch the core dumps. >> >> > >> >> > On Thu, Jun 13, 2013 at 2:06 PM, Benjamin Mahler >> >> > <[email protected]>wrote: >> >> > >> >> > > The logs you provided show that a SIGTERM was received by the >> slave. >> >> Do >> >> > you >> >> > > have something watching and sometimes killing your slaves? Did >> someone >> >> > > issue a kill on the process? >> >> > > >> >> > > >> >> > > On Wed, Jun 12, 2013 at 5:41 PM, Brenden Matthews < >> >> > > [email protected]> wrote: >> >> > > >> >> > > > Hey guys, >> >> > > > >> >> > > > I was wondering how you typically debug slave crashes. I >> frequently >> >> > have >> >> > > > slaves crash, and I've had limited luck with collecting core >> dumps >> >> > > because >> >> > > > of the frequency with which it occurs (it usually crashes >> >> repeatedly, >> >> > and >> >> > > > the cores get overwritten). I figured it would be quicker to >> just >> >> ask >> >> > > how >> >> > > > you've been doing it in the past, rather than trying to reinvent >> the >> >> > > wheel. >> >> > > > >> >> > > > Here's a sample from the slave log that shows a crash: >> >> > > > >> >> > > > I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated >> >> > > 'cpu.shares' >> >> > > > > to 1024 for executor executor_Task_Tracker_129 of framework 20 >> >> > > > > 1306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] Updated >> >> > > > > 'memory.limit_in_bytes' to 2621440000 for executor >> >> > > > executor_Task_Tracker_1 >> >> > > > > 29 of framework 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] Started >> >> > > listening >> >> > > > > for OOM events for executor executor_Task_Tracker_129 of >> framewo >> >> > > > > rk 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] Forked >> >> executor >> >> > > at >> >> > > > = >> >> > > > > 6789 >> >> > > > > Fetching resources into >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' >> >> > > > > Fetching resource >> >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> >> > > > > Downloading resource from >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> >> > > > > HDFS command: hadoop fs -copyToLocal >> >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> >> > > > > Extracting resource: tar xJf >> >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> >> > > > > I0613 00:11:26.225594 1200 slave.cpp:1412] Got registration >> for >> >> > > executor >> >> > > > > 'executor_Task_Tracker_129' of framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] Changing >> >> cgroup >> >> > > > > controls for executor executor_Task_Tracker_129 of framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 with resources >> cpus=9.25; >> >> > > > mem=21200; >> >> > > > > disk=45056; ports=[31000-31000, 31001-31001] >> >> > > > > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued >> task >> >> > > > > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of >> >> > framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] Updated >> >> > > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 >> for >> >> > > > executor >> >> > > > > executor_Task_Tracker_129 of framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] Updated >> >> > > > 'cpu.shares' >> >> > > > > to 9472 for executor executor_Task_Tracker_129 of framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] Updated >> >> > > > > 'memory.limit_in_bytes' to 22229811200 for executor >> >> > > > > executor_Task_Tracker_129 of framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] >> >> Received >> >> > > > status >> >> > > > > update TASK_RUNNING (UUID: >> e984eac5-a9d7-4546-a7e9-444f5f018cab) >> >> for >> >> > > task >> >> > > > > Task_Tracker_129 of framework >> >> 201306122129-1707151626-5050-5724-0000 >> >> > > with >> >> > > > > checkpoint=false >> >> > > > > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] >> >> Creating >> >> > > > > StatusUpdate stream for task Task_Tracker_129 of framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] >> >> Forwarding >> >> > > > > status update TASK_RUNNING (UUID: >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) >> >> > > > for >> >> > > > > task Task_Tracker_129 of framework >> >> > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > to [email protected]:5050 >> >> > > > > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending >> >> acknowledgement >> >> > for >> >> > > > > status update TASK_RUNNING (UUID: >> >> > e984eac5-a9d7-4546-a7e9-444f5f018cab) >> >> > > > for >> >> > > > > task Task_Tracker_129 of framework >> >> > > 201306122129-1707151626-5050-5724-0000 >> >> > > > > to executor(1)@10.17.178.97:58109 >> >> > > > > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] >> >> Received >> >> > > > status >> >> > > > > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for >> >> task >> >> > > > > Task_Tracker_129 of framework >> >> 201306122129-1707151626-5050-5724-0000 >> >> > > > > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage >> 16.54%. >> >> Max >> >> > > > > allowed age: 5.142519016183785days >> >> > > > > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage >> 16.90%. >> >> Max >> >> > > > > allowed age: 5.117054416194364days >> >> > > > > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage >> 17.29%. >> >> Max >> >> > > > > allowed age: 5.089524142974595days >> >> > > > > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage >> 17.61%. >> >> Max >> >> > > > > allowed age: 5.067278719463831days >> >> > > > > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage >> 17.87%. >> >> Max >> >> > > > > allowed age: 5.049103921517002days >> >> > > > > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage >> 17.97%. >> >> Max >> >> > > > > allowed age: 5.041750330234468days >> >> > > > > I0613 00:17:28.162240 1212 slave.cpp:1896] >> >> > [email protected]:5050 >> >> > > > exited >> >> > > > > W0613 00:17:28.162505 1212 slave.cpp:1899] Master >> disconnected! >> >> > > Waiting >> >> > > > > for a new master to be elected >> >> > > > > I0613 00:17:38.006042 1201 detector.cpp:420] Master detector >> >> > > (slave(1)@ >> >> > > > > 10.17.178.97:5051) found 1 registered masters >> >> > > > > I0613 00:17:38.008129 1201 detector.cpp:467] Master detector >> >> > > (slave(1)@ >> >> > > > > 10.17.178.97:5051) got new master pid: >> [email protected]:5050 >> >> > > > > I0613 00:17:38.008599 1201 slave.cpp:537] New master detected >> at >> >> > > > > [email protected]:5050 >> >> > > > > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] New >> >> master >> >> > > > > detected at [email protected]:5050 >> >> > > > > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered with >> >> master >> >> > > > > [email protected]:5050 >> >> > > > > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating framework >> >> > > > > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ >> >> > > > > 10.17.184.87:57804 >> >> > > > > I0613 00:17:46.057116 1202 detector.cpp:420] Master detector >> >> > > (slave(1)@ >> >> > > > > 10.17.178.97:5051) found 2 registered masters >> >> > > > > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage >> 18.06%. >> >> Max >> >> > > > > allowed age: 5.036055129997118days >> >> > > > > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage >> 17.98%. >> >> Max >> >> > > > > allowed age: 5.041545235716157days >> >> > > > > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage >> 17.86%. >> >> Max >> >> > > > > allowed age: 5.049504905524051days >> >> > > > > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage >> 17.86%. >> >> Max >> >> > > > > allowed age: 5.049908279608947days >> >> > > > > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage >> 17.55%. >> >> Max >> >> > > > > allowed age: 5.071801618813565days >> >> > > > > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage >> 17.56%. >> >> Max >> >> > > > > allowed age: 5.070852822503368days >> >> > > > > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage >> 17.56%. >> >> Max >> >> > > > > allowed age: 5.070920390650185days >> >> > > > > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage >> 17.56%. >> >> Max >> >> > > > > allowed age: 5.070920003070000days >> >> > > > > *** Aborted at 1371083126 (unix time) try "date -d >> @1371083126" if >> >> > you >> >> > > > are >> >> > > > > using GNU date *** >> >> > > > > PC: @ 0x7f53c240dd84 __pthread_cond_wait >> >> > > > > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740) >> >> from >> >> > PID >> >> > > > > 1033; stack trace: *** >> >> > > > > @ 0x7f53c2411cb0 (unknown) >> >> > > > > @ 0x7f53c240dd84 __pthread_cond_wait >> >> > > > > @ 0x7f53c3088f03 (unknown) >> >> > > > > @ 0x7f53c308961f (unknown) >> >> > > > > @ 0x40c75a (unknown) >> >> > > > > @ 0x7f53c206476d (unknown) >> >> > > > > @ 0x40d511 (unknown) >> >> > > > > I0613 00:25:26.600497 9946 main.cpp:119] Creating "cgroups" >> >> isolator >> >> > > > > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 >> >> 22:53:54 >> >> > by >> >> > > > > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos slave >> >> > > > >> >> > > >> >> > >> >> >> > >> > >> > >
