[top post only]

Thanks for this Job, interesting analysis.

Another question here: at what interval is data from a given RIR repository ingested / operationalized by a given network operator? Or put differently, any idea how much lag today between when an RIR RPKI repository has a change until that becomes OV policy in _your routers? I'm sure this varies but not sure by how much within a given operator, or across operators.


-danny



On 2020-04-05 14:29, Job Snijders wrote:
Dear Danny, others,

On Fri, Apr 03, 2020 at 04:56:41PM -0400, Danny McPherson wrote:
I also look forward to [your] analysis of the Rostelecom incident that
occurred in the same timeframe.

I've taken a look at the incident. 2,666 VRPs disappeared around
2020-04-01T16:32Z. For the purpose of this analysis the list of affected VRPs is http://instituut.net/~job/deleted-vrps-ripe-2020-04-01-16-32.txt

Andree Toonk (BGPMon) so kind to compile a list of prefixes which were
wrongly originated by Rostelecom during incident at 2020-04-01T19:27Z
https://portal.bgpmon.net/data/12389_apr2020.txt

The above list is not the full list of prefixes affected by this leak.
The leak appears to have included route announcements that 12389
received from some customers and some peers, in addition to 'bgp
optimiser'-style more-specific hijacks. Full list is available here:
https://map.internetintel.oracle.com/api/leak_prefixes/20764_12389_1585768500.pfxs
I'm leaving the 'merely leaked otherwise untouched' routes out of this
analysis as those are outside of scope of Origin Validation: the
fabricated routes in relation to missing RPKI VRPs are what is matters
for this analysis.

If we take the intersection of Andree's list with the list of missing
VRPs, we have the IP addresses that were affected by both the RIPE NCC
RPKI Deletion incident and the Rostelecom BGP incident. The following 12
prefixes (4352 IP addresses):

    peer_count    start_time  alert_type          base_prefix
base_as  announced_prefix  src_AS Affected_ASname   example_ASPath
    49  2020-04-01  19:30:34  more_spec_by_other  91.195.240.0/23
47846  91.195.240.0/24   12389  SEDO-AS,      DE  24751 20764 12389
    12  2020-04-01  19:29:55  more_spec_by_other  62.122.168.0/21
50245  62.122.170.0/24   12389  SERVEREL-AS,  NL  18356 38794 4651
4651 20764 12389
    11  2020-04-01  19:30:34  more_spec_by_other  91.203.184.0/22
41064  91.203.187.0/24   12389  SKYROCK,      FR  29430 13030 20764
12389
    6   2020-04-01  19:32:12  more_spec_by_other  109.206.160.0/19
50245  109.206.164.0/23  12389  SERVEREL-AS,  NL  49673 24811 20764
12389
    6   2020-04-01  19:32:12  more_spec_by_other  109.206.160.0/19
50245  109.206.174.0/23  12389  SERVEREL-AS,  NL  49515 197595 20764
12389
    6   2020-04-01  19:32:12  more_spec_by_other  109.206.160.0/19
50245  109.206.178.0/23  12389  SERVEREL-AS,  NL  49673 24811 20764
12389
    6   2020-04-01  19:32:12  more_spec_by_other  109.206.160.0/19
50245  109.206.168.0/23  12389  SERVEREL-AS,  NL  49673 24811 20764
12389
    6   2020-04-01  19:32:12  more_spec_by_other  109.206.160.0/19
50245  109.206.180.0/23  12389  SERVEREL-AS,  NL  43317 20764 12389
    5   2020-04-01  19:33:04  more_spec_by_other  109.206.160.0/19
50245  109.206.161.0/24  12389  SERVEREL-AS,  NL  49515 197595 20764
12389
    5   2020-04-01  19:33:04  more_spec_by_other  109.206.160.0/19
50245  109.206.170.0/24  12389  SERVEREL-AS,  NL  49673 24811 20764
12389
    5   2020-04-01  19:33:04  more_spec_by_other  109.206.160.0/19
50245  109.206.187.0/24  12389  SERVEREL-AS,  NL  1126 24785 20562
20764 12389
    5   2020-04-01  19:33:04  more_spec_by_other  109.206.160.0/19
50245  109.206.166.0/24  12389  SERVEREL-AS,  NL  51514 20562 20764
12389

If we look at the list of ASNs which were most impacted, the top ten
seems mostly anchored to the US (thus under the ARIN TAL), and almost
all of them seem heavyweights in the cloud / CDN space.
https://portal.bgpmon.net/data/12389_apr2020_affected_asns.txt

The incorrect routing information covering to the above listed prefixes
was observed by a limited number of BGPMon peers, for other affected
routes the peer_count was around 170. While the RPKI incident lasted a
number of hours, but the Rostelecom routing incident lasted ten minutes
or so. (source:
https://map.internetintel.oracle.com/leaks#/id/20764_12389_1585768500)

If we assume the generation & propagation of these hijacks was the
result of operator error, I imagine the change could've been reverted
almost immediately but we'd still see a bit of sloshing for a few
minutes through the routing system. Or perhaps the 'waves' we can see in
Oracle's 3D rendering of the incident are the effects of Maximum Prefix
limits kicking in and various timers firing off at different times.

Were these prefixes just unlucky because some BGP optimiser algorithm
had chosen them for the purpose of traffc engineering? Was this the
result of sophisticated planning? In any case, I can't judge the impact
this routing incident had on the three above listed ASNs. I don't know
what the victim IPs are used for.

We have to keep in mind that a large portion of RIPE NCC's RPKI
repository, and of course the RPKI repositories of the other RIRs were
*not* affected. ISPs with 'invalid == reject' policies had lot of RPKI
data (~134,516 VRPs) available and those VRPs did have positive effects
on the scope and reach of the hijacks. RPKI Invalid BGP announcements
don't propagate as as good as Not-Found announcements.

It appears the 'peer_count' for RPKI protected prefixes was
significantly lower (~140) than prefixes not covered by RPKI ROAs
(~160). The 'peer_count' value can be considered a proxy metric for a
hijack's reach and impact. The RPKI Invalids in this leak propagated
through ASNs for which we know they have not yet deployed RPKI OV.

The above suggests to me that unavailability of RPKI services during
routing incidents, or lack of deployment of Origin Validation confirms
what most of us already suspected: it is inconvenient.

RIPE NCC's service interruption appears to have affected 4,352 out of
the total of 5,945,764 misrouted IPs, and the 'peer_count' for the
illegitimate announcements was much lower (better) compared to other
prefixes.

This leads me to believe this was not a deliberate plan dependent on a
process failure inside RIPE NCC, the incident's BGP data just doesn't
seem to show the incident maximally capitalised on the RPKI outage.

Kind regards,

Job


Reply via email to