Hi Randy,
> On 27 Oct 2021, at 19:45, Randy Bush <[email protected]> wrote:
>
>> We aim to keep this simple at an initial stage, closely monitor how
>> the environment behaves
>
> i am deeplying interested in how a CA and a PP (and RP and routers) are
> measured and monitored. in general, i am scared to death of the growing
> deployment of rpki and rov with so little, if any, measurement.
>
> so i would beg/encourage you to publish how you do this and maybe even
> think of making your tools more generally useful.
We have made a significant investment in monitoring and alerting, using
Prometheus. I will introduce the part of our monitoring relevant to the
repository content below (there is more), and we will include an update on
monitoring in our RIPE 83 presentation.
We have metrics for the Certification Authority (CA) system, monitor Relying
Party software instances, and run tools specifically for monitoring. We also run
smoke-tests (via the UI) and an end-to-end test that validates that VRPs for a
ROA created by a user become visible to RPs.
The metrics in the CA system are mostly for liveliness (e.g. "job x is running
successfully"), ongoing publication, and error (rates). We do test (hosted) CA
creation/deletion in our staging environment - but not in our production
environment because we do not have the two (hosted, delegated) production LIR
accounts required.
As a liveliness check for the publication server instances, we check when the
publication server received the last withdrawal and publish, and when the most
recent notification.xml is written (using an RP via serial and directly).
Furthermore, we have three types of checks on the content of the repository. For
this, we have two endpoints on the CA system: "hash and filename of all files in
the repository" and "all VRPs".
For the files, using an internal tool, we check that:
* All files in the CA "filename+hash" endpoint are present in each repository
(rsync instances, publication server instances, rrdp.ripe.net) after they
have had time to converge.
* Not too many "leftover" files in each of the repository instances.
* No objects are present in the repo that are about to expire within ~13.5
hours.
Using RP instances, we check that:
* All VRPs in the CA system show up in the effective VRPs within
<time_threshold> (using rtrmon).
Because we monitor that no files are mismatching between the CA system and the
repositories, this check implies that the VRPs are visible in all the repository
instances.
Please let us know if there is interest in the tool we use to compare
repositories. We might add that to our roadmap if there is interest.
Kind regards,
Ties