lists.gluster.org outage Date: 2018-03-11
Participating people: - misc - amye Summary: supercolony.gluster.org had a disk full. So mailman was out. https://bugzilla.redhat.com/show_bug.cgi?id=1554176 Impact: - gluster lists user Root cause: The single partition of the server was full. This was most likely due to our WAF (mod_security) logs taking a ton of space (around 1.5G for each week) , due to a uptake of bot scanning around and the old wordpress blog being scanned. In turn, the old wordpress blog wasn't supposed to be exposed anymore but the configuration for it was still there since ansible never remove file, and the configuration for the bare IP vhost is the first encountered vhost, which was "blog.gluster.org". So in turn, the /xmlrpc.php url was triggering alerts on the WAF, and mod_sec is kinda verbose. Resolution: - yum cache was cleaned to get back 600M as a emergency measure # yum clean all - logs from mod_sec were compressed using gzip going from ~ 1.5G each to 40M. # for i in /var/log/httpd/modsec_audit.log-2* ; do gzip $i ; done - blog.gluster.org vhost config was removed # rm -f /etc/httpd/conf/blog.gluster.org* ; service httpd restart Lessons learned: - what went well: - a bug was filled - the root cause was quickly identified and fixed - when we were lucky - misc was awake and connected on internal irc on the weekend night - what went bad - no monitoring - bad partition setup - bad cleanup of httpd configuration Timeline (in UTC) - 2018-03-12 01:11 amye ping misc on internal irc and internal channel with https://bugzilla.redhat.com/show_bug.cgi?id=1554176 - 2018-03-12 01:13 misc diagnose the issue on "disk full" - 2018-03-12 01:17 misc free 600M while waiting on du -sh to finish - 2018-03-12 01:22 misc pinpoint the issue on the WAF and compress the log for further examination - 2018-03-12 01:24 misc notice the wordpress exposure issue, remove the vhost from the config - 2018-03-12 Potential improvement to make: - we need to install better monitoring - the pattern of having 1 big server for everything should be changed, as this lead to problem on cleanup, and lack of separation mean we have 1 single domain of failure (so issue on legacy system impact prod system). - split duty of supercolony on separate VM - move it to the cage - httpd logs should be rotated _and_ compressed. - people shouldn't work on weekend - reconsider mod_security usage on that server -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-infra
