Re: [dns-operations] anchors.atlas.ripe.net/ripe.net - DNSSEC bogus due expiration
> On 3 Nov 2023, at 02:18, Viktor Dukhovni wrote: > > On Thu, Nov 02, 2023 at 09:34:17AM +0100, Stephane Bortzmeyer wrote: > >>> Specifically, in the case of signed zones, monitoring MUST also include >>> regular checks of the remaining expiration time of at least the core >>> zone apex records (DNSKEY, SOA and NS), and ideally the whole zone, both >>> on the primary server and the secondaries. >> >> Indeed. If you use Nagios or compatible (such as Icinga), I recommend >> this plugin for signatures monitoring: >> >> http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html > > I wonder whether the widely authoritative resolvers could do more to > to help? > > For example, BIND loads zone data into memory. It should be able to > know the time of the soonest signature expiration for a zone, or at > least (if not loaing the whole zone into memory) the soonest expiration > time is of recently queried records. When you let named perform the signing it does just that. The RRSIGs are in a heap. We look at the earliest expiration and figure out when it is due to be re-signed (could be in the past if the server was offline for a while). We set a timer. When that timer expires we re-sign that RRset plus several more along with an updated SOA record re-adding them to the heap. We set a timer for the next batch. If the primary has been down too long and they have all expired the entire zone will be signed this way when the primary starts up. > There could be a new "rdnc" protocol verb that asks the nameserver for a > list of all the zones where the soonest expiration time is below some > threshold, or askes about a particular zone. > > Of course in that case the monitoring agent would be a in a "privileged" > position to query the nameserver's internal control plane, rather than > having to send queries through "the front door". > > Both kinds of monitoring are likely important, but more visibility via > the control plane may be able to offer a precise/timely view. > >- Check each nameserver's control plane. >- Check as much of the zone as possible. >- Check each nameserver VIP over each supported protocol > (UDP, TCP, DoT, DoQ, ...) >- From multiple vantage points if possible/applicable when > service is on anycast IPs. > > Perhaps through OARC support development of monitoring plugins that > many operators can use? > > If after all the past incidents minor and not so minor operators > still aren't doing adequate monitoring, perhaps we (the software > and standards) developers and haven't given them adequate tools? > > -- >Viktor. > ___ > dns-operations mailing list > dns-operations@lists.dns-oarc.net > https://lists.dns-oarc.net/mailman/listinfo/dns-operations -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org ___ dns-operations mailing list dns-operations@lists.dns-oarc.net https://lists.dns-oarc.net/mailman/listinfo/dns-operations
Re: [dns-operations] anchors.atlas.ripe.net/ripe.net - DNSSEC bogus due expiration
On Thu, Nov 02, 2023 at 09:34:17AM +0100, Stephane Bortzmeyer wrote: > > Specifically, in the case of signed zones, monitoring MUST also include > > regular checks of the remaining expiration time of at least the core > > zone apex records (DNSKEY, SOA and NS), and ideally the whole zone, both > > on the primary server and the secondaries. > > Indeed. If you use Nagios or compatible (such as Icinga), I recommend > this plugin for signatures monitoring: > > http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html I wonder whether the widely authoritative resolvers could do more to to help? For example, BIND loads zone data into memory. It should be able to know the time of the soonest signature expiration for a zone, or at least (if not loaing the whole zone into memory) the soonest expiration time is of recently queried records. There could be a new "rdnc" protocol verb that asks the nameserver for a list of all the zones where the soonest expiration time is below some threshold, or askes about a particular zone. Of course in that case the monitoring agent would be a in a "privileged" position to query the nameserver's internal control plane, rather than having to send queries through "the front door". Both kinds of monitoring are likely important, but more visibility via the control plane may be able to offer a precise/timely view. - Check each nameserver's control plane. - Check as much of the zone as possible. - Check each nameserver VIP over each supported protocol (UDP, TCP, DoT, DoQ, ...) - From multiple vantage points if possible/applicable when service is on anycast IPs. Perhaps through OARC support development of monitoring plugins that many operators can use? If after all the past incidents minor and not so minor operators still aren't doing adequate monitoring, perhaps we (the software and standards) developers and haven't given them adequate tools? -- Viktor. ___ dns-operations mailing list dns-operations@lists.dns-oarc.net https://lists.dns-oarc.net/mailman/listinfo/dns-operations
[dns-operations] post-mortem for ripe.net DNSSEC problem on 1 November 2023
Dear colleagues, Please find below the post mortem for the DNSSEC problem that caused most of RIPE NCC's services to become unavailable yesterday. Please reach out if you have any questions or feedback. Thanks, Paul de Weerd Manager Global Information Infrastructure team RIPE NCC Summary On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone were bogus due to expired DNSSEC signatures being served. This rendered most of the RIPE NCC’s services unreachable. After investigating the issue, we found a typo in a change to our zone where a record had a TTL that was longer (864,000 seconds instead of 86,400) than the refresh interval for RRSIGs (seven days). This caused our signer to stop refreshing signatures and only sign changes to the zone. We are talking to the vendor of our DNSSEC signing solution about this case to see what can be improved on that end, have implemented a pre-commit check to prevent TTLs longer than a day in the ripe.net zone and are looking at improving monitoring for stale signatures to spot issues like this before they cause problems. Impact DNSSEC signatures in the ripe.net zone are valid for 14 days, with our signers configured to resign them after half that time (seven days). On 1 November at 10:45 UTC the signature on several records in the ripe.net zone expired. These records had last been signed on 18 October and were due to be re-signed on the 25th. However, due to a problem with the TTL on one record, our signer stopped re-signing records in the zone on 25 October. This resulted in the expiry of 11,026 out of 11,389 records on 1 November. New or changed records were still properly signed (363 of them), which meant that our monitoring, which checks the signature validity of the SOA record at the zone apex, missed this issue. Because our internal resolvers are configured for DNSSEC validation, the impact was rather immediate for staff, as many internal services broke due to this issue. After first dismissing some alternative causes, we quickly found the problem was with expired signatures in the ripe.net zone, so we turned our attention to our signers. At the same time, we temporarily disabled DNSSEC validation on our internal resolvers so we could more easily access our own systems while troubleshooting. Resolution While debugging, we found that the rrsig-refresh option that we configured to seven days (half the value of the rrsig-lifetime option of 14 days) was likely involved, logs showed: info: [ripe.net.] DNSSEC, signing zone error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired RRSIGs in resolver caches info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+ error: [ripe.net.] zone event 're-sign' failed (invalid parameter) At 12:14 UTC we removed that option from our configuration and we could sign the zone again. The freshly signed zone was pushed out and went live a little bit later, which meant that at 12:15 UTC our services were available again for most users. Unfortunately, some users kept seeing problems for several hours after we restored the signatures. Root cause After further investigation we found that the change that triggered this problem introduced a record in the ripe.net zone with a TTL of 864,000 (ten days). Because this TTL is longer than our rrsig-refresh configuration, this could lead to cases where a resolver’s cache contains the record with an expired signature. The signer software rightfully complained about this. We were surprised to find it then stopped refreshing signatures for all records in the zone that didn’t change. Future steps During the incident and the aftermath we identified a few changes that we want to make to improve the resiliency of our setup and allow us to find cases like these before they become problems. Our current RRSIG freshness monitoring did not catch this case, because the records we monitor still had valid and recent signatures, so we are considering what we can do to cover this situation. We have also improved our zone-editing pipeline to catch typos or misconfigurations for TTL values. Next to that, the problem also affected our ability to communicate internally, as our internal chat system was unresolvable too. We have some means of out-of-band communication, but will review how we can improve that. Additionally, while the status.ripe.net website is hosted on separate infrastructure, the fact that it is also in the ripe.net domain meant that it was just as unreachable as our other services. We will evaluate this approach and see how we can improve on it. Timeline (times in UTC) 25 October 08:52 a record was added to the ripe.net zone with a TTL of 10 days 08:53 knot incrementally signs ripe.net successfully 09:02 knot fails to sign the ripe.net zone for the first time 1 November 10:45 ripe.net signatures expire and many records go bogus 11:27 DNSSEC validation on internal resolvers was
[dns-operations] [ra...@psg.com: swedish dns zone enumerator]
A domain crawler (nothing catastrophic, just for information). --- Begin Message --- i have blocked a zone enumerator, though i guess they will be a whack-a-mole others have reported them as well /home/randy> sudo tcpdump -pni vtnet0 -c 10 port 53 and net 193.235.141 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vtnet0, link-type EN10MB (Ethernet), capture size 262144 bytes 22:42:39.516849 IP 193.235.141.90.32768 > 666.42.7.11.53: 14 NS? 33j4h.org.al. (30) 22:42:39.517640 IP 193.235.141.17.32768 > 666.42.7.11.53: 14 NS? 33m6d.xn--mgbayh7gpa. (38) 22:42:39.519169 IP 193.235.141.17.32768 > 666.42.7.11.53: 14 NS? 33lxd.tn. (26) 22:42:39.520064 IP 193.235.141.171.32768 > 666.42.7.11.53: 14 NS? 33md6.jo. (26) 22:42:39.521081 IP 193.235.141.247.32768 > 666.42.7.11.53: 14 NS? 33lxd.lb. (26) 22:42:39.523981 IP 193.235.141.162.32768 > 666.42.7.11.53: 14 NS? 33pd2.az. (26) 22:42:39.525043 IP 193.235.141.60.32768 > 666.42.7.11.53: 14 NS? 33nc5.com.al. (30) 22:42:39.526185 IP 193.235.141.209.32768 > 666.42.7.11.53: 14 NS? 33nc5.sz. (26) 22:42:39.527931 IP 193.235.141.150.32768 > 666.42.7.11.53: 14 NS? 33q5p.com.al. (30) 22:42:39.529516 IP 193.235.141.210.32768 > 666.42.7.11.53: 14 NS? 33qbq.com.al. (30) 10 packets captured 124 packets received by filter 0 packets dropped by kernel inetnum:193.235.141.0 - 193.235.141.255 netname:domaincrawler-hosting descr: domaincrawler hosting org:ORG-ABUS1196-RIPE country:SE admin-c:VIJE1-RIPE tech-c: VIJE1-RIPE status: ASSIGNED PA notify: c+1...@resilans.se mnt-by: RESILANS-MNT mnt-routes: ETTNET-LIR created:2008-04-03T11:21:00Z last-modified: 2017-04-10T12:47:06Z source: RIPE randy --- End Message --- ___ dns-operations mailing list dns-operations@lists.dns-oarc.net https://lists.dns-oarc.net/mailman/listinfo/dns-operations
Re: [dns-operations] anchors.atlas.ripe.net/ripe.net - DNSSEC bogus due expiration
On Wed, Nov 01, 2023 at 12:18:42PM -0400, Viktor Dukhovni wrote a message of 67 lines which said: > Specifically, in the case of signed zones, monitoring MUST also include > regular checks of the remaining expiration time of at least the core > zone apex records (DNSKEY, SOA and NS), and ideally the whole zone, both > on the primary server and the secondaries. Indeed. If you use Nagios or compatible (such as Icinga), I recommend this plugin for signatures monitoring: http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html (If you use Debian, it is in the package monitoring-plugins-contrib.) ___ dns-operations mailing list dns-operations@lists.dns-oarc.net https://lists.dns-oarc.net/mailman/listinfo/dns-operations