Hi, So since yesterday, docs.gluster.org certificate was expired.
Date: 2018-06-11 Participating people: - misc - nigel Summary: docs.gluster.org certificate was expired, due to automation error not renewing it. Since the certificate was on Lets Encrypt, it expire after 3 month and should have been renewed yesterday, but wasn't. Impact: - people add to accept a expired certificate for docs.gluster.org Root cause: So, investigation show multiple root causes. docs.gluster.org was using the new proxy system (see http://lists.gluster.org/pipermail/gluster-in fra/2018-February/004284.html ). The automation was using a ansible module (openssl_certificate) that do not seems to take in account renewal of certificate, at least not by default. As the module was already fixed in the past (see https://githu b.com/ansible/ansible/commits/devel/lib/ansible/modules/crypto/openssl_ certificate.py ), this was a oversight on my side after reading the code. However, that's also one of the reason to not migrate too much and see how renewal work. And since this requires a acme server, this was quite hard to add to ansible CI in the first place. Short answer, it doesn't work, and I will submit a patch upstream for that. On top of that, a few bugs were found on ansible causing problem: - openssl_certificate used acme-tiny. And by default, acme-tiny didn't download the intermediate certificate. But a quick reading of acme-tiny help showed a --chain option, so a patch was sent to ansible for that: https://github.com/ansible/ansible/pull/35144 Turn out that the --chain option was a downstream patch for the package, that got removed (not deprecated) in the latest rpm version: https://src.fedoraproject.org/rpms/acme-tiny/c/ecd867acdf5380ade6874c16 0e8a00ce14d3f8ba?branch=master So immediate consequence, the deployment of a new certificate failed with a error. - openssl_certificate do not properly handle acme-tiny failure, since it seems to nonetheless create a file and consider it ok. I didn't investigate that much, but that's also something to be fixed upstream. - nginx do not detect the creation of new certificate, so a explicit restart/reload need to be added to our playbooks when certificate are renewed (once that's done in the module, of course). It doesn't happen for the initial creation, so this wasn't seen earlier. Resolution: - screaming to relieve the existential crisis upon realisation that breakage waited monday to happen, on my day back to work - renewed the certificate semi manually in the mean time - pushed https://github.com/gluster/gluster.org_ansible_configuration/c ommit/fb8655c8d07948d2362d5e9213de001399bde06e as a workaround for now - opened https://github.com/ansible/ansible/issues/41396 to get stuff fixed upstream What went well: - only the docs website was impacted and it was seen quite fast - it could be worked around by users When we were lucky - I was back in the office, awake enough and my phone wasn't out of battery. What went bad - still no supervision for that on gluster side - no one seems to have notified us - we can't really count on Fedora policy to be applied (either there or on EPEL), which is kinda making me sad Timeline (in UTC) 11 June 2018 certificate expire 11h39: nigelb ping misc on irc. Misc is out for lunch, so he do not get the message 11h44: nigelb ping misc on telegram. Misc rush back to his desk, take proper music[1] and start to look at it while sipping his coffee. 11h53: issue is diagnosed, acme-tiny do not renew certificate for some reasons 1st fix is tried, remove the certificate, and restart ansible deployment ( 2 to 3 minutes each, kinda anticlimatic). It fail, with a error related to --chain. Misc remember this was some stuff that were fixed before 11h56 another attempt with ansible devel branch is done, assuming some bugfix weren't pushed back to stable branch 11h59 still fail. A quick hack is done to push 12h02 acme do not deploy anything, because files were already here. So certificates are removed again, and restarted 12h04 certificate are created. Nginx didn't got restarted so a restart on proxy01 and proxy02 is done. Stuff are back. Potential improvement to make: - fix all the stuff that need fixing - cancel Monday for the rest of the week [1] in this case, this was Hacknet OST. Great game, I recommend it. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-infra