On Mon, 3 Aug 2020 at 12:14, Philip Homburg <[email protected]> wrote: > This is not an ideal place to discuss how atlas works. It would be better to > use the atlas mailing list > (https://lists.ripe.net/mailman/listinfo/ripe-atlas). > > In any case, we have 2 controllers for software probes. Those controllers are > kept separate from controllers for hardware probes. We placed both > controllers at Hetzner and unfortunately, Hetzner seems to have some issues > recently (or at least more than I remember). Typically a controller handles > about 500 probes, so it will take some time before we will get controllers in > other places in the world. > > Quite a bit of atlas backend logic depends on probes having just one > controller at a time, so this is unlikely to change soon. > > Sending probes to different controllers happens automatically, but it happens > on a time scale of around 6 hours. It seems that the issue at Hetzner was > less than 2 hours. However, with all controllers for software probes at > Hetzner, a long failure at Hetzner would indeed impact all software probes.
In this case, why not have the software probes set up with a fall-back RIPE Atlas Anchor (controller)? E.g. - Setup a list of 2 or more Anchors in a hierarchical order, - If the software probe cannot reach the primary Anchor for more than 10-20 minutes., fall-back to the next Anchor reachable, - Retest connectivity to all configured anchors every 6 hours, - Revert to use the first reachable anchor in the locally configured list (a hierarchical order). * The above idea assumes an Anchor being able to temp. handle more than the default ~500 probes per Anchor, plus co-location diversity of >= 2 providers and >= 3 anchors. -- Chriztoffer
