Re: [dns-operations] anchors.atlas.ripe.net/ripe.net - DNSSEC bogus due expiration

2023-11-02 Thread Mark Andrews



> On 3 Nov 2023, at 02:18, Viktor Dukhovni  wrote:
> 
> On Thu, Nov 02, 2023 at 09:34:17AM +0100, Stephane Bortzmeyer wrote:
> 
>>> Specifically, in the case of signed zones, monitoring MUST also include
>>> regular checks of the remaining expiration time of at least the core
>>> zone apex records (DNSKEY, SOA and NS), and ideally the whole zone, both
>>> on the primary server and the secondaries.
>> 
>> Indeed. If you use Nagios or compatible (such as Icinga), I recommend
>> this plugin for signatures monitoring:
>> 
>> http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html
> 
> I wonder whether the widely authoritative resolvers could do more to
> to help?
> 
> For example, BIND loads zone data into memory.  It should be able to
> know the time of the soonest signature expiration for a zone, or at
> least (if not loaing the whole zone into memory) the soonest expiration
> time is of recently queried records.

When you let named perform the signing it does just that.  The RRSIGs are
in a heap.  We look at the earliest expiration and figure out when it is
due to be re-signed (could be in the past if the server was offline for a
while).  We set a timer.  When that timer expires we re-sign that RRset plus
several more along with an updated SOA record re-adding them to the heap.
We set a timer for the next batch.  If the primary has been down too long
and they have all expired the entire zone will be signed this way when the
primary starts up. 

> There could be a new "rdnc" protocol verb that asks the nameserver for a
> list of all the zones where the soonest expiration time is below some
> threshold, or askes about a particular zone.
> 
> Of course in that case the monitoring agent would be a in a "privileged"
> position to query the nameserver's internal control plane, rather than
> having to send queries through "the front door".
> 
> Both kinds of monitoring are likely important, but more visibility via
> the control plane may be able to offer a precise/timely view.
> 
>- Check each nameserver's control plane.
>- Check as much of the zone as possible.
>- Check each nameserver VIP over each supported protocol
>  (UDP, TCP, DoT, DoQ, ...)
>- From multiple vantage points if possible/applicable when
>  service is on anycast IPs.
> 
> Perhaps through OARC support development of monitoring plugins that
> many operators can use?
> 
> If after all the past incidents minor and not so minor operators
> still aren't doing adequate monitoring, perhaps we (the software
> and standards) developers and haven't given them adequate tools?
> 
> -- 
>Viktor.
> ___
> dns-operations mailing list
> dns-operations@lists.dns-oarc.net
> https://lists.dns-oarc.net/mailman/listinfo/dns-operations

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742  INTERNET: ma...@isc.org


___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations


Re: [dns-operations] anchors.atlas.ripe.net/ripe.net - DNSSEC bogus due expiration

2023-11-02 Thread Viktor Dukhovni
On Thu, Nov 02, 2023 at 09:34:17AM +0100, Stephane Bortzmeyer wrote:

> > Specifically, in the case of signed zones, monitoring MUST also include
> > regular checks of the remaining expiration time of at least the core
> > zone apex records (DNSKEY, SOA and NS), and ideally the whole zone, both
> > on the primary server and the secondaries.
> 
> Indeed. If you use Nagios or compatible (such as Icinga), I recommend
> this plugin for signatures monitoring:
> 
> http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html

I wonder whether the widely authoritative resolvers could do more to
to help?

For example, BIND loads zone data into memory.  It should be able to
know the time of the soonest signature expiration for a zone, or at
least (if not loaing the whole zone into memory) the soonest expiration
time is of recently queried records.

There could be a new "rdnc" protocol verb that asks the nameserver for a
list of all the zones where the soonest expiration time is below some
threshold, or askes about a particular zone.

Of course in that case the monitoring agent would be a in a "privileged"
position to query the nameserver's internal control plane, rather than
having to send queries through "the front door".

Both kinds of monitoring are likely important, but more visibility via
the control plane may be able to offer a precise/timely view.

- Check each nameserver's control plane.
- Check as much of the zone as possible.
- Check each nameserver VIP over each supported protocol
  (UDP, TCP, DoT, DoQ, ...)
- From multiple vantage points if possible/applicable when
  service is on anycast IPs.

Perhaps through OARC support development of monitoring plugins that
many operators can use?

If after all the past incidents minor and not so minor operators
still aren't doing adequate monitoring, perhaps we (the software
and standards) developers and haven't given them adequate tools?

-- 
Viktor.
___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations


[dns-operations] post-mortem for ripe.net DNSSEC problem on 1 November 2023

2023-11-02 Thread Paul de Weerd

Dear colleagues,

Please find below the post mortem for the DNSSEC problem that caused 
most of RIPE NCC's services to become unavailable yesterday.


Please reach out if you have any questions or feedback.

Thanks,

Paul de Weerd
Manager Global Information Infrastructure team
RIPE NCC


Summary

On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone 
were bogus due to expired DNSSEC signatures being served. This rendered 
most of the RIPE NCC’s services unreachable. After investigating the 
issue, we found a typo in a change to our zone where a record had a TTL 
that was longer (864,000 seconds instead of 86,400) than the refresh 
interval for RRSIGs (seven days). This caused our signer to stop 
refreshing signatures and only sign changes to the zone. We are talking 
to the vendor of our DNSSEC signing solution about this case to see what 
can be improved on that end, have implemented a pre-commit check to 
prevent TTLs longer than a day in the ripe.net zone and are looking at 
improving monitoring for stale signatures to spot issues like this 
before they cause problems.



Impact

DNSSEC signatures in the ripe.net zone are valid for 14 days, with our 
signers configured to resign them after half that time (seven days). On 
1 November at 10:45 UTC the signature on several records in the ripe.net 
zone expired. These records had last been signed on 18 October and were 
due to be re-signed on the 25th. However, due to a problem with the TTL 
on one record, our signer stopped re-signing records in the zone on 25 
October. This resulted in the expiry of 11,026 out of 11,389 records on 
1 November. New or changed records were still properly signed (363 of 
them), which meant that our monitoring, which checks the signature 
validity of the SOA record at the zone apex, missed this issue.


Because our internal resolvers are configured for DNSSEC validation, the 
impact was rather immediate for staff, as many internal services broke 
due to this issue. After first dismissing some alternative causes, we 
quickly found the problem was with expired signatures in the ripe.net 
zone, so we turned our attention to our signers. At the same time, we 
temporarily disabled DNSSEC validation on our internal resolvers so we 
could more easily access our own systems while troubleshooting.



Resolution

While debugging, we found that the rrsig-refresh option that we 
configured to seven days (half the value of the rrsig-lifetime option of 
14 days) was likely involved, logs showed:


info: [ripe.net.] DNSSEC, signing zone
error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired 
RRSIGs in resolver caches

info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+
error: [ripe.net.] zone event 're-sign' failed (invalid parameter)

At 12:14 UTC we removed that option from our configuration and we could 
sign the zone again. The freshly signed zone was pushed out and went 
live a little bit later, which meant that at 12:15 UTC our services were 
available again for most users. Unfortunately, some users kept seeing 
problems for several hours after we restored the signatures.



Root cause

After further investigation we found that the change that triggered this 
problem introduced a record in the ripe.net zone with a TTL of 864,000 
(ten days). Because this TTL is longer than our rrsig-refresh 
configuration, this could lead to cases where a resolver’s cache 
contains the record with an expired signature. The signer software 
rightfully complained about this. We were surprised to find it then 
stopped refreshing signatures for all records in the zone that didn’t 
change.



Future steps

During the incident and the aftermath we identified a few changes that 
we want to make to improve the resiliency of our setup and allow us to 
find cases like these before they become problems. Our current RRSIG 
freshness monitoring did not catch this case, because the records we 
monitor still had valid and recent signatures, so we are considering 
what we can do to cover this situation. We have also improved our 
zone-editing pipeline to catch typos or misconfigurations for TTL values.


Next to that, the problem also affected our ability to communicate 
internally, as our internal chat system was unresolvable too. We have 
some means of out-of-band communication, but will review how we can 
improve that.


Additionally, while the status.ripe.net website is hosted on separate 
infrastructure, the fact that it is also in the ripe.net domain meant 
that it was just as unreachable as our other services. We will evaluate 
this approach and see how we can improve on it.



Timeline (times in UTC)

25 October
08:52 a record was added to the ripe.net zone with a TTL of 10 days
08:53 knot incrementally signs ripe.net successfully
09:02 knot fails to sign the ripe.net zone for the first time

1 November

10:45 ripe.net signatures expire and many records go bogus
11:27 DNSSEC validation on internal resolvers was 

[dns-operations] [ra...@psg.com: swedish dns zone enumerator]

2023-11-02 Thread Stephane Bortzmeyer
A domain crawler (nothing catastrophic, just for information).
--- Begin Message ---
i have blocked a zone enumerator, though i guess they will be a
whack-a-mole

others have reported them as well

/home/randy> sudo tcpdump -pni vtnet0 -c 10 port 53 and net 193.235.141
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vtnet0, link-type EN10MB (Ethernet), capture size 262144 bytes
22:42:39.516849 IP 193.235.141.90.32768 > 666.42.7.11.53: 14 NS? 33j4h.org.al. 
(30)
22:42:39.517640 IP 193.235.141.17.32768 > 666.42.7.11.53: 14 NS? 
33m6d.xn--mgbayh7gpa. (38)
22:42:39.519169 IP 193.235.141.17.32768 > 666.42.7.11.53: 14 NS? 33lxd.tn. (26)
22:42:39.520064 IP 193.235.141.171.32768 > 666.42.7.11.53: 14 NS? 33md6.jo. (26)
22:42:39.521081 IP 193.235.141.247.32768 > 666.42.7.11.53: 14 NS? 33lxd.lb. (26)
22:42:39.523981 IP 193.235.141.162.32768 > 666.42.7.11.53: 14 NS? 33pd2.az. (26)
22:42:39.525043 IP 193.235.141.60.32768 > 666.42.7.11.53: 14 NS? 33nc5.com.al. 
(30)
22:42:39.526185 IP 193.235.141.209.32768 > 666.42.7.11.53: 14 NS? 33nc5.sz. (26)
22:42:39.527931 IP 193.235.141.150.32768 > 666.42.7.11.53: 14 NS? 33q5p.com.al. 
(30)
22:42:39.529516 IP 193.235.141.210.32768 > 666.42.7.11.53: 14 NS? 33qbq.com.al. 
(30)
10 packets captured
124 packets received by filter
0 packets dropped by kernel

inetnum:193.235.141.0 - 193.235.141.255
netname:domaincrawler-hosting
descr:  domaincrawler hosting
org:ORG-ABUS1196-RIPE
country:SE
admin-c:VIJE1-RIPE
tech-c: VIJE1-RIPE
status: ASSIGNED PA
notify: c+1...@resilans.se
mnt-by: RESILANS-MNT
mnt-routes: ETTNET-LIR
created:2008-04-03T11:21:00Z
last-modified:  2017-04-10T12:47:06Z
source: RIPE

randy
--- End Message ---
___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations


Re: [dns-operations] anchors.atlas.ripe.net/ripe.net - DNSSEC bogus due expiration

2023-11-02 Thread Stephane Bortzmeyer
On Wed, Nov 01, 2023 at 12:18:42PM -0400,
 Viktor Dukhovni  wrote 
 a message of 67 lines which said:

> Specifically, in the case of signed zones, monitoring MUST also include
> regular checks of the remaining expiration time of at least the core
> zone apex records (DNSKEY, SOA and NS), and ideally the whole zone, both
> on the primary server and the secondaries.

Indeed. If you use Nagios or compatible (such as Icinga), I recommend
this plugin for signatures monitoring:

http://dns.measurement-factory.com/tools/nagios-plugins/check_zone_rrsig_expiration.html

(If you use Debian, it is in the package monitoring-plugins-contrib.)

___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations