Re: [Community-Discuss] 06 April 2019 RPKI incident - Postmortem report

Ben Maddison via Community-Discuss Wed, 10 Apr 2019 07:02:25 -0700

Hi Owen,


On 2019-04-10 15:00:28+02:00 Owen DeLong wrote:


On Apr 10, 2019, at 3:57 AM, Ben Maddison via Community-Discuss 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,


On 2019-04-10 12:10:22+02:00 Noah wrote:

+1 and Ack @saul

On Wed, 10 Apr 2019, 12:57 Saul Stein, 
<[email protected]<mailto:[email protected]>> wrote:
Agreed.

There is a bigger issue at stake here: I have yet to see any evidence that 
AFRINIC takes RPKI seriously.

Until relatively recently, this attitude may have been understandable, since 
the RPKI was largely a curiosity with almost no impact on operations.
This is no longer the case, and all of the RIRs have serious work to do to 
improve operations in this area. This is clearly the case in this region.
Absent a badly flawed implementation, there’s no serious consequence to an RPKI 
outage… It merely reverts routing back to it’s previous unauthenticated state.

That's not entirely true. A partial outage (where for example a single TAL 
becomes unverifiable, as in this case) may lead to a missing ROA for a prefix 
that remains covered by other ROAs issued under other TALs.

Consider ROAs:
{prefix: 2001:db8::/32, maxLength: 48, asn: 65000, tal: AFRINIC}
{prefix: 2001:db8:f00::/48, maxLength: 48, asn: 65001, tal: RIPE}

With the above, a route 2001:db8:f00::/48 via 65000_65001 will have a status 
Valid.
If the RIPE TAL fails verification, it will become Invalid.

This is most certainly a corner case, but is at least theoretically possible 
given that all RIRs claim 0/0 in their root certs.

A more likely scenario is that an existing mis-origination that is being 
dropped as Invalid suddenly becomes Not Found, and wins path selection, thereby 
misdirecting traffic.

I'm not aware of any such cases on our network from this last outage, but it's 
possible that they went undetected. The likelihood of this case increases 
substantially as more operators begin to drop Invalids.

I’m not convinced that all of the RIRs have serious work to do here. I think 
some of the RIRs have stable, reliable operations in this regard. I’m not yet 
convinced that the level of stability in AfriNIC operations here is impactful, 
let alone seriously impactful to operations.

The last issue I had, when no ROAs could be added, deleted etc, it was admitted 
that the issue was known about for over two weeks without anything on the 
announce list or being fixed! After escalation to the CEO and others it was 
fixed in a couple of hours!

As an operator community, we need to have a serious conversation about what we 
expect from afrinic (and the other RIRs). 24x7 availability comes with a price 
tag, as everyone on this list should be all too aware.
Any availability comes with a price tag. The higher the level of availability, 
the higher the price tag.

It is quite clear however, both from recent experience and from the postmortem 
below, that the current system is unfit for purpose.
Is it? I’m unconvinced at this time…

A system, whether human or computerised, that fails to mitigate an impending 
and predictable failure, and then takes several hours to correct it is, in my 
book the very definition of unfit for purpose.


RPKI is serious and needs to be taken seriously. We can’t continuously be 
having issues with it. It  is like customs at immigration being offline!
RPKI is operational. I’m not sure how serious it is, as I have trouble taking 
seriously a system which, at best, tells you what you need to prepend. It’s a 
nice protection from fat fingers, but, in its current state, it provides little 
to no protection beyond that for anyone but the largest operators.

Becoming bestpath in a densely-interconnected network using a forged-origin 
hijack in the face of a ROA that has all it's covered prefixes in the DFZ is 
actually not trivial, and often not possible, because you loose on path-length.

This is even more true in networks that filter peers and customers by prefix, 
as you have less chance of exploiting a higher local-pref to overcome 
path-length.
As a result, the protection that OV offers is not meaningless.

OV is also a prerequisite for validating the entire path against stated policy, 
various mechanisms for which are currently WIP in SIDROPS and GROW.

Nonetheless, even if one wants to take RPKI seriously, a quick review of the 
RFCs and IETF guidance on the matter shows that the worst case scenario for an 
RIR outage on ROA publication should be that routing reverts to its pre-RPKI 
unauthenticated state. It should not cause any sort of outage (except to the 
extent you might start accepting routes you previously rejected).
If you’re rejecting routes for RPKI validation failure, you should be tracking 
down the advertisers and getting those situations corrected. If you’re doing 
that, then any such outages should be somewhere between minimal and 
non-existent.
If issuing party tools are unavailable to resource holders, they will be unable 
to effect the correction.

Did any packets go the wrong way due to the AfriNIC outage? Was there any 
actual operational impact?
I suspect not. I suspect that this is a lot of handwaving about a non-issue.

I suspect some, though certainly few. The same incident at a time when more 
networks are performing OV could look very different, and I'm simply suggesting 
that we look closely at the options before that happens.

Don’t get me wrong, I’m all for making AfriNIC’s systems more resilient and 
more available, but, I think we also need to consider the actual impact of 
failures and not over-react to failures without impact.
Based on the information in the post mortem, it does not look like a systems 
failure, but purely human error. Taking the humans out of the loop on that 
monthly maintenance would involve compromising the integrity of the private key 
and thus reduce the validity of the RPKI data. As such, I’m not convinced that 
there is a problem here to solve beyond the procedural changes that AfriNIC 
says they have already implemented.

I'm not arguing for any specific change. I would point out that when there are 
humans in the system, then a human failure *is* a system failure, and we should 
be clear on what failure modes can and cannot be tolerated.

Adding rpki-discuss, so that we can continue there...

Cheers,

Ben

Owen
From: Mark Tinka [mailto:[email protected]<mailto:[email protected]>]
Sent: 10 April 2019 08:32 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [Community-Discuss] 06 April 2019 RPKI incident - Postmortem report

Thanks, Cedrick.

A question that is, perhaps, obvious... are you able to take the human 
component out of this? If 2 reminders were not enough to get the humans to act, 
I'm not sure the current methodology is sustainable.

Mark.
On 8/Apr/19 17:46, Cedrick Adrien Mbeyet wrote:
Dear AFRINIC community,

Find below postmortem report on the incident that happen on 06 April 2019.

The AFRINIC RPKI engine has an offline part that has to be renewed on a monthly 
bases. The process is known, documented and automated reminders set. The system 
is set to send 2 reminders each month, one 15 days prior to the expiry date and 
the second one 7 days before expiry. On the 2nd half of March, the monitoring 
system sent a reminder to perform the offline refresh but this was not acted 
upon.


On Saturday 06 April 2019,  Certificate revocation List (CRL) and the manifest 
file of AFRINIC RPKI repository expired (around 07:24AM UTC). Our monitoring 
system picked this up. The immediate action was to generate new certificates 
and manifest file and upload them onto RPKI engine system.

The failure was as a result of human error, no changes were made on the system 
but we have taken additional steps to the existing process to ensure that this 
does not happen again. We do acknowledge that it is unacceptable to have such a 
failure with critical infrastructure and necessary done in this regard.


We do apologize for the inconvenience caused and thank you for your patience in 
this regard.

--

_______________________________________________________________

Cedrick Adrien Mbeyet

Infrastructure Unit Manager, AFRINIC Ltd.

t:  +230 403 5100 / 403 5115 | f: +230 466 6758 | tt: @afrinic | w: 
www.afrinic.net<http://www.afrinic.net/>

facebook.com/afrinic<http://facebook.com/afrinic> | 
flickr.com/afrinic<http://flickr.com/afrinic> | 
youtube.com/afrinicmedia<http://youtube.com/afrinicmedia>

______________________________________________________





_______________________________________________
Community-Discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.afrinic.net/mailman/listinfo/community-discuss
_______________________________________________
Community-Discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.afrinic.net/mailman/listinfo/community-discuss

_______________________________________________
Community-Discuss mailing list
[email protected]
https://lists.afrinic.net/mailman/listinfo/community-discuss

Re: [Community-Discuss] 06 April 2019 RPKI incident - Postmortem report

Reply via email to