Please follow up with your notes & corrections, but take off iaep@ and sysad...@gnu.org from Cc to avoid spamming them.
== Incident Timeline == [Thu Jan 28 05:03] OOM killer kicks in, killing a bunch of processes [Thu Jan 28 05:29] OLE Nepal notifies sysad...@sugarlabs.org and bernie.code...@gmail.com of an outage. [Thu Jan 28 07:42] OOM killer kicks in again [Thu Jan 28 08:45] Scg notices the outage and pings me via Hangouts [Thu Jan 28 09:30] I wake up and see scg's ping [Thu Jan 28 09:47] I respond to OLE, cc'ing all other sysadmins [Thu Jan 28 12:17] Quidam reboots sunjammer == Root causes == Unknown OOM condition, likely caused by apache serving some query-of-death: Jan 28 03:07:25 sunjammer kernel: [88262817.489410] apache2 invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 Jan 28 03:07:26 sunjammer kernel: [88262817.489428] apache2 cpuset=/ mems_allowed=0 [...] Jan 28 03:09:52 sunjammer kernel: [88262818.691465] Out of memory: Kill process 32000 (apache2) score 8 or sacrifice child Jan 28 03:09:52 sunjammer kernel: [88262818.691473] Killed process 32000 (apache2) total-vm:571328kB, anon-rss:52460kB, file-rss:65036kB [...keeps going on like this for hours...] Jan 28 07:42:12 sunjammer kernel: [88279272.739371] apache2 invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 Jan 28 07:42:12 sunjammer kernel: [88279272.739390] apache2 cpuset=/ mems_allowed=0 Jan 28 07:42:12 sunjammer kernel: [88279272.739397] Pid: 4835, comm: apache2 Tainted: G D 3.0.0-32-virtual #51~lucid1-Ubuntu == What went wrong == - The primary sysadmin contact sysad...@sugarlabs.org was non-functional - We couldn't contact the FSF sysadmins promptly - Took us several hours to get the machine back online - sunjammer was still up, but too unresponsive to ssh in == What worked == - Scg noticed the outage quickly and responded - OLE reached me via gmail -> develer.com forwarder (pure luck, I usually don't check my personal email before leaving for work) - sunjammer styed up continously for over 1000 days - sunjammer still boots correctly... at least now we know :-) - Communication between us kept working via side-channels - The Linux OOM killer did its job ;-) == Action Items == - Continue moving web services to Docker containers *WITH HARD MEMORY BOUNDS* - Ask FSF to (re-)enable XEN console for sunjammer - Ask for FSF on-call contact - (maybe) Move monitoring to a smaller container - Publish phone/email emergency contacts that page core sysadmins independent of all SL infrastructure. - (maybe) Disable swap to prevent excessive I/O from slowing down sunjammer to the point of timing out ssh connections - Work with FSF sysadmins to figure outw I/O is so slow on sunjammer. A simple "sync" can take several seconds even though there isn't much disk activity. -- Bernie Innocenti Sugar Labs Infrastructure Team http://wiki.sugarlabs.org/go/Infrastructure_Team _______________________________________________ IAEP -- It's An Education Project (not a laptop project!) IAEP@lists.sugarlabs.org http://lists.sugarlabs.org/listinfo/iaep