Date: 14 June 2019 Participating: - misc - deepshika
Summary: -------- People started to report http issue on review.gluster.org, while our monitoring was silent (monitoring keep spamming me during the night about download server being almost full following 1719388, so I know it was working). A quick investigation show this was due to the DNS record to be returning 2 entries, which result into round robin between the wrong server and the right one. Timeline (in UTC): ------------------- - 2019-05-08: misc go on vacation - 2019-05-24: RH IT contact misc (and others), saying that mails with a return address of "rev...@review.gluster.org" clog their smtp servers queue. Folks receive the mails, the server say "this is no longer someone working here", try to sent back, this doesn't work, it fill the queue of the MX. As a few people left RH in the last 6 months, and some getting likely all notifications, this did create problem for them. Postfix is heavily I/O bound (all communication between the dozens of daemons are done using queue on disk, synced for reliability), filling queue result into impacting a lot the operation of a MX, slowing it down, resulting in bigger queues, etc. - 2019-05-27: Deepshika and Duck try to fix this, not understanding why the email is not working, or how it was supposed to work (spoiler, it was never supposed to work). Conclude by "too weird, we need to wait". - 2019-06-12: misc is back from vacation, see his mails, prioritize them and explain that the rev...@review.gluster.org wasn't supposed to be working, hence why people found nothing, and this was just the default setting of gerrit. - 2019-06-13: misc decide to setup a MX for the review.gluster.org domain to drop all incoming emails, solving the bounce issue for IT. See ansible repo commits for how that's done. - 16:57 misc add a MX record for review.gluster.org to point to supercolony IP address, after adding the code to route the whole domain to /dev/null, then wait a bit to see nothing broke and go home (assuming monitoring would scream during the evening if anything happen). The diff for the DNS change is show later[1]. - 23:00 seeing monitoring didn't scream, misc decide to go to bed and sleep. - 2019-06-14: folks start to report outage as India folks start their day - 04:17: bug 1720453 is opened - 05:31: Deepshika correctly diagnose the DNS issue, see that is was related to last change, and try to contact misc on telegram - 07:50: misc wake up, see his phone blinking, answer to the messages - 08:10: misc check various things, reach the same conclusion as deepshika, propose a workaround - 08:13: after squinting hard at the diff, misc finally find something that could be the cause - 08:14: a commit is pushed (again, see the end) - 08:15: DNS record is verified, and seems to be fixed - 08:50: coffee is poured in a mug in misc's flat, and that port mortem is redacted Impact: ------- review.gluster.org was randomly reachable for some people for a few hours. I suspect the cage wasn't affected due to DNS cache, but some jobs might have been affected. The gluster.org top domain might have been impacted too, but I am not sure how (MX was in place, DNS too, and we do not use direct gluster.org anywhere, plus, I think there is some fallback and cache), and nobody did report anything (and the monitoring also didn't scream). Root cause: ----------- The DNS entry was wrong, it did return 2 IP addresses while it should have been a single one. But the exact behavior was (IMHO) quite subtle, as people will see now. The initial DNS diff was this: --- a/prod/external-default/gluster.org +++ b/prod/external-default/gluster.org @@ -1,6 +1,6 @@ $TTL 300 @ IN SOA ns1.redhat.com. noc.redhat.com. ( - 2019040301 ; Serial + 2019061301 ; Serial 3600 ; Refresh 1800 ; Retry 604800 ; Expire @@ -12,6 +12,7 @@ $TTL 300 IN NS ns3.redhat.com. ; IN MX 10 mx2.gluster.org. +review IN MX 10 mx2.gluster.org. ;build IN MX 10 mx1.gluster.org. @@ -34,7 +35,6 @@ lists IN CNAME supercolony.rht git IN CNAME gerrit.rht patches IN CNAME gerrit.rht -review IN CNAME gerrit.rht gerrit IN CNAME gerrit.rht gerrit-new.rht IN CNAME gerrit.rht @@ -60,6 +60,8 @@ _kerberos-master._udp SRV 0 0 88 freeipa.gluster.org. _kerberos-master._tcp SRV 0 0 88 freeipa.gluster.org. postgresql.rht IN A 8.43.85.170 +review IN A 8.43.85.171 gerrit.rht IN A 8.43.85.171 ; testVM for the switch to nftable chrono.rht IN A 8.43.85.172 At a first look, any sysadmin will likely say this seems correct, converting review to a A record (cause MX and CNAME can't coexist, I couldn't push that due to zone syntax check on commit), adding a MX record. I assume that the reader do not see what is wrong with this one (not more than me when I wrote it yesterday, and did check my change), and to be fair, what is wrong is not visible in the diff. The fix was this (edited for readability): --- a/prod/external-default/gluster.org +++ b/prod/external-default/gluster.org @@ -1,6 +1,6 @@ $TTL 300 @ IN SOA ns1.redhat.com. noc.redhat.com. ( - 2019061301 ; Serial + 2019061401 ; Serial 3600 ; Refresh 1800 ; Retry 604800 ; Expire @@ -10,18 +10,19 @@ $TTL 300 IN NS ns1.redhat.com. IN NS ns2.redhat.com. IN NS ns3.redhat.com. + IN A 8.43.85.176 ; IN MX 10 mx2.gluster.org. review IN MX 10 mx2.gluster.org. - IN A 8.43.85.176 ; RH DC mx2 IN A 8.43.85.176 Turn out that contrary to what I did believe, the zone file format is not a format where each line is fully separate, and where order do not matter (there is $ORIGIN, etc). When you add a entry and give no name in a record (first word on the line), it doesn't use the domain name (that's the role of "@" or $ORIGIN), but it inherit the previous one (see https://en.wikipedia.org/wiki/Zone_file). So far, this did result in the same effect for gluster.org zone file, because every record without a explicit name (the first field) was at the start, and the first record is the domain name. But it all changed once I added the MX. Cause this went from (edited to remove space, comment, and make the issue obvious and visible) IN NS ns3.redhat.com. IN MX 10 mx2.gluster.org. IN A 8.43.85.176 mx2 IN A 8.43.85.176 to: IN NS ns3.redhat.com. IN MX 10 mx2.gluster.org. review IN MX 10 mx2.gluster.org. IN A 8.43.85.176 mx2 IN A 8.43.85.176 Which, using that presentation and indentation, kinda hint that there is a problem. I always thought that the indentation was mostly cosmetic, and the format (unlike python) do not requires it. Turn there is more. The first commit placed the MX record at the wrong place, which changed the meaning of the following line (the one that was out of the diff). This did result in review.gluster.org having a 2nd A record (for 8.43.85.176, supercolony), stealing the one of the apex domain (or top or naked domain). That is the same exact issue as https://lists.gluster.org/pipermail/gluster-infra/2018-August/004905.html (DNS one). Except that back then, I never found the problem. Resolution: ------------ - DNS got fixed What went well: --------------- - not much, I was just lucky to find the issue. It was the 2nd time I looked, and last time, I didn't found. I guess what went well is that it didn't went worst. When we were lucky: ------------------- - I didn't overslept too late, and wasn't more jetlagged from vacation[2] - the issue was found quickly, which is close to a miracle given I just woke up, and I didn't found back in august 2018. What went bad: -------------- - monitoring didn't alert of anything. Given DNS propagation, it should have alerted me during the evening if something happened, or so did I think so. - DNS automated verification didn't pick that, because that was valid. - manual verification didn't yield a error. Not sure why this did work from my side of the world every time. To do: ------ - contact the Holy Seer to get that certified as "miracle". I am not a morning person. - try to understand why monitoring failed to see something failed. - now that we fixed the issue, go back to the change in August that cause the issue the first time and apply again (routing build.gluster.org to /dev/null). That work was on going yesterday already, not pushed because it was late. Notes ----- [1] yes, this post mortem follow the Chekhov's gun principle. [2] yes, that's not much for a lucky perspective. But I did manage to sleep around 16h after taking the plane last week, it took me a while to adjust. -- Michael Scherer Sysadmin, Community Infrastructure
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra