Le mardi 20 novembre 2018 à 15:39 +0100, Michael Scherer a écrit : > Date: 20 november 2018 > > Participating people: > - misc > - obnox > > Summary: > > Our automated certificate renewal system failed to renew > docs.gluster.org certificate, resulting in a expired certificate > for around 6h. Our monitoring system decided detect the problem. > > Impact: > > Some people would had to accept a insecure certificate to read the > website > > Root cause: > So, on the monitoring side, it seems that "something" did broke > alerting. However, upon restart and testing, it seems to be working > fine now. However, the configuration by default do not seems to > verify that the certificate is going to expire, and so verify only > that the > port 443 is open and a ssl request can be negociated. > > On the certificate renewal side, all is covered by ansible, and we do > a automated run every night. > A manual run didn't show any error, so my analysis point toward a > failure of the automation. Looking at ant-queen, our deploy server, > it seems that a issue > on 2 internal builders (builder1 and builder31) created a deadlock > when ansible tried to connect, and for some > reason, didn't timeout. In turn, this did result in several process > waiting on those 2 servers. > > Looking at the graph, we can see the problem started around 1 week > ago: > > https://munin.gluster.org/munin/int.rht.gluster.org/builder1.int.rht. > gluster.org/users.html > > Since our system will only trigger renewal if the certificate is > going to expire in 1 week, this did result in the process not trying > to renew > for more than 1 week, and so the certificate expired. > > A quick look on builder1 and 31 show that the issue is likely due to > regression testing. The command 'df' is blocked on builder1, and > that's usually > a sign of "something went wrong with the test suite". A look at the > existing process hint the gd2 test suite, since there is etcd2 still > running, > and glusterfsd process too. > > > Resolution: > - misc ran the process manually, and the certificate got renewed > - misc restarted nagios and alert started to work > - misc went on a process cleaning spree, unlocking a achievement on > Steam by stopping 70 of them in 1 command > > What went well: > - people contacted us > - only 1 certificate got impacted > > When we were lucky: > - this only impacted docs.gluster.org, and a user workaround did > exist > > What went bad: > - supervision didn't paged anyone > > > Timeline (in UTC): > > 05:00 the certificate expire. > 09:30 misc decide to go to the office > 09:50 misc arrive at the train station and get in the train, then > connect on irc just in case > 10:01 obnox ping misc on irc > 10:02 misc say crap, take a look, confirm the issue > 10:05 misc connect on ant-queen, run the deploy script after checking > the 2 proxies are ok > 10:07 misc see that the certificate got renewed and inspect ant- > queen, see a bunch of process blocked on 2 servers > 10:08 entering a tunnel, misc declare the issue be fixed and will > look once in the office > > Potential improvement to make: > - our supervision should check certificate validity. (should be easy) > - our supervision should also verify that the we do not have > something weird on ant-queen (less easy)
As I got paged because the load on ant-queen was too high, I think this part is done. > - whatever caused nagios to fail should be investigated, and > mitigated So nagios failed with out of memory since the 6 of november. While this didn't result in a complete outage, it likely broken enough to create some issues. A look at the graph show that starting around the 15, nagios didn't check anything since there was no traffic: https://munin.gluster.org/munin/int.rht.gluster.org/nagios.int.rht.glus ter.org/fw_packets.html Memory graph show what look like a memory leak: https://munin.gluster.org/munin/int.rht.gluster.org/nagios.int.rht.glus ter.org/memory.html It seems to have started around the start of week 43, so around the 29 of october. The only change regarding packages is tzdata. However, it could also be a side effect of this commit: https://github.com/gluster/gluster.org_ansible_configuration/commit/01c 7e7120ea1cac27aa6d0cbcdf3da726a59c67c I am going to investigate if it can be reverted and see. > - whatever caused ansible to fail should be investigated, and > mitigated > - our gd2 test suite should clean itself in a more reliable way > > _______________________________________________ > Gluster-infra mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/gluster-infra -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-infra
