[Gluster-infra] Gluster mailman outage portmortem

Michael Scherer Mon, 12 Mar 2018 06:01:54 -0700

lists.gluster.org outage

Date: 2018-03-11


Participating people:
 - misc
 - amye

Summary:
supercolony.gluster.org had a disk full. So mailman was out.
https://bugzilla.redhat.com/show_bug.cgi?id=1554176


Impact:
- gluster lists user

Root cause:

The single partition of the server was full. 

This was most likely due to our WAF (mod_security) logs taking a ton of
space (around 1.5G for each week) , due to a uptake of bot scanning
around and the old wordpress blog being scanned. In turn, the old
wordpress blog wasn't supposed to be exposed anymore but the
configuration for it was still there since ansible never remove file,
and the configuration for the bare IP vhost is the first encountered
vhost, which was "blog.gluster.org". So in turn, the /xmlrpc.php url
was triggering alerts on the WAF, and mod_sec is kinda verbose.

Resolution:

- yum cache was cleaned to get back 600M as a emergency measure

  # yum clean all

- logs from mod_sec were compressed using gzip going from ~ 1.5G each
to 40M. 

  # for i in /var/log/httpd/modsec_audit.log-2* ; do gzip $i ; done

- blog.gluster.org vhost config was removed

  # rm -f /etc/httpd/conf/blog.gluster.org* ; service httpd restart 

Lessons learned:

- what went well:
  - a bug was filled
  - the root cause was quickly identified and fixed

- when we were lucky
  - misc was awake and connected on internal irc on the weekend night

- what went bad
  - no monitoring
  - bad partition setup
  - bad cleanup of httpd configuration

Timeline (in UTC)

- 2018-03-12  01:11  amye ping misc on internal irc and internal
channel with https://bugzilla.redhat.com/show_bug.cgi?id=1554176
- 2018-03-12  01:13  misc diagnose the issue on "disk full"
- 2018-03-12  01:17  misc free 600M while waiting on du -sh to finish
- 2018-03-12  01:22  misc pinpoint the issue on the WAF and compress
the log for further examination
- 2018-03-12  01:24  misc notice the wordpress exposure issue, remove
the  vhost from the config
- 2018-03-12 


Potential improvement to make:

- we need to install better monitoring

- the pattern of having 1 big server for everything should be changed,
as this lead to problem on cleanup, and lack of separation mean we have
1 single domain of failure (so issue on legacy system impact prod
system).
  - split duty of supercolony on separate VM
  - move it to the cage

- httpd logs should be rotated _and_ compressed.

- people shouldn't work on weekend

- reconsider mod_security usage on that server

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

signature.asc
Description: This is a digitally signed message part

_______________________________________________
Gluster-infra mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Gluster mailman outage portmortem

Reply via email to