Thank you, Job for your excellent and detailed and constructive analysis! Now get some rest. :-)
Definitely explains why we saw what we saw (Routinator affected, RIPE Validator not). At this point we are moving to pivot to FORT and perhaps rpki-client as well given the recent experiences. Thanks again! Tony On Wed, Dec 2, 2020 at 4:33 PM Job Snijders <[email protected]> wrote: > Hi all, > > First of all to be very clear: there was no 'APNIC outage', APNIC did > nothing wrong. This was a 'validator outage', and locally outages like > these can continue to be experienced at any future moment until fixed > versions are released and deployed. Note: network operators who run FORT > or OpenBSD rpki-client side-by-side with routinator/octorpki will have > seen a stable VRP merged set item count on their EBGP routers. In this > situation RPKI validator software diversity helped the Internet remain > more stable. > > APNIC staff are commendable for having seen an opportunity to implement > a workaround for this routinator 0.8.1 quirk, but APNIC is just one of > the tens of thousands of Certificate Authorities in the RPKI ecosystem. > In short: the observed state of December 1st, 2020 00:00 UTC is an > expected and normal state in the RPKI ecosystem. > > I appreciate George for reaching out to the community to draw more > attention to the situation, as it seems we can learn from exploring this > situation in great detail. For many in the community RPKI is a new > technology. Also it appears a similar issue exists in Cloudflare's > OctoRPKI, so I notified their developers too about the problem & > solution. Since there are implementations with a bug in the same > equivalence class, this case is best handed over to the IETF. > > While keeping in mind our human perception of the concept of time > generally is somewhat incompatible with how time works in the X.509 / > RPKI crypto world... here are my lengthy debug notes. :-) > > TL;DR: the VRP drop is an implementation issue in some RPKI validators, > can happen again > solution: wait for fixed version, or run multiple different RPKI validator > implementations side by side > there a bit of time pressure: this bug potentially interacts negatively > with Juniper PR1483097. > > Every 20 minutes I copy all RPKI data from the Internet, run rpki-client > [1], and store the original RPKI data files, the program's execution > log, and the resulting VRP list as individual ZFS snapshots for > post-mortem analysis. A copy of my data can be downloaded: it is an > exact snapshot of all input data from that moment, to replay the event > in various implementations. > http://sobornost.net/~job/rpki-20201201-0001-adrian.sobornost.net.tar.gz > > Looking at the process' log of December 1st, 2020 run starting at > midnight for the string 'apnic': > > root@adrian:/tank/rpkirepositories/.zfs/snapshot/20201201-0001# fgrep > apnic output/log > Dec 01 00:00:01 rpki-client: https://tal.apnic.net/apnic.cer: https > schema ignored > Dec 01 00:00:01 rpki.apnic.net/repository: pulling from network > Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository: loaded from > cache > Dec 01 00:00:03 rpki-client: rpki.apnic.net/member_repository: > pulling from network > Dec 01 00:00:03 rpki-client: rpki.sub.apnic.net/repository: pulling > from network > Dec 01 00:00:03 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer: > certificate has expired > Dec 01 00:00:03 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/9lv88f3YSSS6iXQmzBvPX6hvnQM.cer: > certificate has expired > Dec 01 00:00:03 rpki-client: rpki.rand.apnic.net/repo: pulling from > network > Dec 01 00:00:04 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/pBp2e-TKxusbiXQjNgwrQ1OsH_s.cer: > certificate has expired > Dec 01 00:00:04 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZnMLuaQLNc_lmxGF9iLb0JAMbZA.cer: > certificate has expired > Dec 01 00:00:04 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/yZYCtJIcaINWT0smUVwdY-TPNkQ.cer: > certificate has expired > Dec 01 00:00:04 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/WFBPIARWFTaBikTQvkFutQVej0g.cer: > certificate has expired > Dec 01 00:00:05 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/QmfPXQMASo_v3yE5XQ_oJFSLE8E.cer: > certificate has expired > Dec 01 00:00:05 rpki-client: rpki.sub.apnic.net/repository: loaded > from cache > Dec 01 00:00:05 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/maB2Nu64AHCDMDGWpYxBvsxoj4A.cer: > certificate has expired > Dec 01 00:00:05 rpki-client: > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/d0JlIBzwsNjMdvAm-Ir2i1XpkO4.cer: > certificate has expired > Dec 01 00:00:05 rpki-client: > rpki.apnic.net/repository/B3A24F201D6611E28AC8837C72FD1FF2/0I2GgcK-TUfCopBV9m5olVhGF_c.cer: > certificate has expired > Dec 01 00:00:06 rpki-client: rpki.rand.apnic.net/repo: loaded from > cache > Dec 01 00:00:12 rpki-client: rpki.apnic.net/member_repository: loaded > from cache > > (At the end of the process's run it had observed 62,154 VRPs under the > APNIC TAL. A CSV & JSON file of the validation process output with all > VRPs from that moment is also included in the tar.gz file.) > > In the above log we see that a number of certificates are expired, > according to Tom's message [2] these certificates represents APNIC > members whose membership has been closed. (for example: companies going > out of business, or merger & acquisition) It is expected for > organizations issuing cryptographic products to tie business events to > validity periods in certificates. > > For the purpose of these notes I'll focus only on following the > validation process towards 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' in a manual > fashion using command line utilities. > > After having pulled RPKI from the web (which operationally speaking > end-to-end is a multi-hour process to get the data from signer to > validator), a number of process steps have to be performed in order to > produce a list of Validated ROA Payloads (VRPs). None of these steps can > be skipped, and the order is important too. > > A single manifest file (https://tools.ietf.org/html/rfc6486) actually is > a bundle of a few things: a start & end date of the file listing, a list > of filenames and sha256 hashes, and a EE certificate (which also has its > own embedded start & end date!), a serial number, and references to > other things such as which entity signed it. > > The first step is to figure out whether a given manifest file is 'valid' > (are the signatures right) and 'current' (the timestamp on the > validator's wall clock is between both the manifest's embedded start & > end date AND the EE certificate validity dates), and the 'latest' > (should the validator have to choose between two versions of the file, > both valid and current, pick the one with the highest number). > > So at December 1st 00:00:03 UTC, the manifest's start & end date, and > the EE certificate's start and end date were: > > $ tar fxz rpki-20201201-0001-adrian.sobornost.net.tar.gz > $ cd 20201201-0001/data/ > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2 > > $ ls -lahtr DmWk9f02tb1o6zySNAiXjJB6p58.mft > -rw-r--r-- 1 job wheel 214K Nov 30 23:01 > DmWk9f02tb1o6zySNAiXjJB6p58.mft > > This file's ctime appears to be November 30th, 23:01 > > # check manifest's econtent start & end date > $ strings DmWk9f02tb1o6zySNAiXjJB6p58.mft | head -2 > 20201130230107Z > 20201202230107Z > > December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd > 23:01:07: check! > > # check the manifest's embedded EE certificate start & end date: > $ test-mft -vp DmWk9f02tb1o6zySNAiXjJB6p58.mft | openssl x509 -text | > grep -A2 Validity > Validity > Not Before: Nov 30 23:01:07 2020 GMT > Not After : Dec 2 23:01:07 2020 GMT > > December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd > 23:01:07: check! > > With the dates and signatures of the manifest file check out to be 'all > lights green', the next step is to process the manifest's file listing. > A manifest 'file listing' is checked through two steps: > > - is the listed file present? > - is the sha256 hash (in base64 format) listed on the manifest the > same as the sha256 hash computed by the validator using a copy of > the listed file? > > # looking at manifest file listing: > $ test-mft -v DmWk9f02tb1o6zySNAiXjJB6p58.mft | grep -A1 > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer > 95: ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer > hash YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc= > > # checking whether file is present: > $ ls -alhtr ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer > -rw-r--r-- 1 job wheel 1.5K Nov 30 23:01 > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer > > # compute sha256 hash of the file > $ sha256 -b ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer > SHA256 (ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer) = > YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc= > > Indeed, the 'YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=' hash computed > from the referenced certificate file is the same one as listed in the > manifest file (which we inspected with test-mft)! Note that at this > stage of the validation process the 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' > file has not been processed in any other way other than the equivalent > of that 'sha256' OpenBSD utility. > > These 'jumps' from certificate to manifest to certificate using hashes & > signatures serve multiple purposes: by first confirming a hash matches, > the validator does not (yet) need to attempt any file content parsing > (which would potentially be sensitive computing operations on an at that > point in time a unknown and potentially dangerous file), and secondly: > by checking the presence and hash of each file, the publication point's > completeness and integrity is confirmed. Missing .roa files can result > in network outages [3]. > > At this point the manifest file has been completely processed, the next > step in the validation process can commence. Each and every referenced > file is opened by the validator, embedded certificates and sigantures > are verified, and then again file contents processed (could be > manifests, certificates, CRLs, or ROA files). > > Let's inspect ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer: > > $ openssl x509 -in ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer -inform DER -text | > grep -A2 Validity > Validity > Not Before: Oct 23 10:14:32 2019 GMT > Not After : Dec 1 00:00:00 2020 GMT > > As the validator's wall clock was December 1st 00:00:03, we can see that > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer expired '3 seconds ago'. Note that > before we observed that creation time on the manifest file which > referenced this .cer file was November 30th, at that time this > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer certificate was valid, present, and > current! > > One could say that ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer is a child of > DmWk9f02tb1o6zySNAiXjJB6p58.mft. The ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer > file might not even be under control of the entity which generated > DmWk9f02tb1o6zySNAiXjJB6p58.mft. A child's expiry does not result in the > death of the parent. If a validator considers all referenced files on a > manifest to be invalid, solely because *upon further inspection* a file > contained contained an expired EE certificate, I'd say it is an > 'overreaction', a simple software defect. After all, there was a valid > current manifest which listed a hash and that hash matched the file, so > the file became eligible for X.509 certificate validation in the first > place! > > It appears that Routinator conflates two distinct steps in the > validation process: > > step 1) checking the validity of a RPKI manifest > step 2) checking the validity of a file referenced from the in step 1 > validated manifest > > A valid manifest referencing a (now expired) certificate is a legitimate > state of being. What is not valid is for the manifest listing itself to > be expired, or the manifest's EE certificate to be expired, or its CRL > to be expired, or its parent certificate to be expired, or for any files > listed on the manifest to be missing, or for any sha256 hashes to be > different than listed on the manifest. Phew.... that's a mouthful of > conditions! We're gonna have to work in IETF to capture this in simpler > english. > > Conclusion > ========== > > I'm not saying validators should accept expired data, they shouldn't! > But it is *expected* that Certificate Authorities (like LIRs, NIRs, or > even RIRs) set the expiration dates on cryptographic objects to be > aligned with the reality of business contracts. This is a *critical* > feature of the RPKI and makes RPKI superior to IRR data: finally there > /are/ expiration dates on the equivalent of 'route:' objects. > > A repeat of the 'december 1st' VRP drop situation can come into > existence at any future moment under any Trust Anchor, under any > Certificate Authority. Simply put: network solely relying on current > versions of octorpki or routinator are somewhat at risk when billing > cycles end. Also, I do not recommend downgrading to older versions > because of https://www.nlnetlabs.nl/projects/rpki/security-advisories/ > (which perversely is a bug that *is not* resolved with rpki software > diversity). > > I suspect it is OK for network operators to choose to sit this one out > and just wait for a fixed version, provided it can be released in a > manner of weeks. Because of Juniper PR1483097 (which probably still > affects many currently deployed internet routers) the complete > disappearance of VRPs can negatively impact internet traffic forwarding > in the default-free zone, but as mentioned before impact is avoided both > through multi-instance validator deployment combined with validator > software diversity. > > There is a silver lining in all this: the most likely next occurance > of this type of situation is January 1st, 2021, as then all kinds of > LIR, NIR, or RIR business contracts are likely to start or stop. This > gives nlnetlabs and cloudflare almost a full month to figure out a fix, > release it, and for operators to deploy it in their networks during the > holidays. The perfect excuse to escape any unwanted christmas dinner. ;-) > > I propose some of us continue discussion at [email protected] where > through wordsmithing in the draft-ietf-sidrops-6486bis effort so we help > any future RPKI implementers from walking into the same problem. > > Kind regards, > > Job > > [1]: https://pkgs.org/search/?q=rpki-client > [2]: https://lists.nlnetlabs.nl/pipermail/rpki/2020-December/000238.html > [3]: > https://blog.apnic.net/2020/11/10/rpki-manifests-securely-declare-contents/ >
-- RPKI mailing list [email protected] https://lists.nlnetlabs.nl/mailman/listinfo/rpki
