Hi, here is yet another postomortem (cause I feel like it).
Build.gluster.org not sending mail Date: 2016-04-27 Participating people: - misc Summary: the ip address used by the Jenkins server of Gluster (build.gluster.org) ended in a DNS blacklist, thus preventing sending mail on the mailing list server (supercolony.gluster.org) among others. Impact: - new releases wasn't notified to maintainers and Amye - mail notification might not have been received Root cause: not found at the moment. Investigation showed that the ip address was present in SBL, which pulled the ip from CBL (another blacklist). However, looking at jenkins mail, none seems to have triggered that. Upon further look, it was found that the ip address assigned to build.g.o (66.187.224.203) is different from the one used for outgoing connexion (66.187.224.184). This is caused by a asymetric setup for the firewall NAT in the DC. So infosec was notified of the problem, since this could have been caused by a malware on any server behind the ip address. Resolution: Immediate fix was to remove sbl from the list of blacklist used by supercolony, which was done by a commit on that file (https://github.com/gluster/gluster.org_salt_pillar/blob/master/smtp_blacklist.sls ). Thus mail should be sendable again on jenkins. Lessons learned: - what went well: - someone did seen that mail were not received and notified admins. - when we were lucky - not much critical mail traffic is coming from jenkins - it failed during business hours of EMEA with misc being "idle" and looking at irc, while on PTO. - what went bad - we do not have proper monitoring for that kind of issue - there isn't details on how the server was added in the list Timeline (in UTC) 14 May 2016 - 09:53 first message in the log about being in the blacklist 15 May 2016 - 12:36 ndevos ping misc on irc (#gluster-dev) about the problem - 12:38 misc found that 66.187.224.184 is in sbl-xbl.spamhaus.org - 12:43 misc found that the real CBL blocking is CBL, with "It was last detected at 2016-05-15 00:00 GMT" - 12:51 misc remember that the ip is shared, so that's normal to not find anything the jenkins server - 13:00 infosec is notified (INC0401121) - 13:02 commit d552601 remove the dns bl from supercolony - 20:00 infosec investigate and tell there is no sensor on that link 16 May 2016 - 06:30 postmortem is sent Potential improvement to make: - add monitoring for that - check logs for errors - add it to monitoring (either gluster side when we have it, or IT side) - whitelist gluster server ip in postfix - get a separate ip for the server - proper infosec monitoring like the others -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-infra
