On Mon, 3 Aug 2020 at 12:14, Philip Homburg <[email protected]> wrote:
> This is not an ideal place to discuss how atlas works. It would be better to 
> use the atlas mailing list 
> (https://lists.ripe.net/mailman/listinfo/ripe-atlas).
>
> In any case, we have 2 controllers for software probes. Those controllers are 
> kept separate from controllers for hardware probes. We placed both 
> controllers at Hetzner and unfortunately, Hetzner seems to have some issues 
> recently (or at least more than I remember). Typically a controller handles 
> about 500 probes, so it will take some time before we will get controllers in 
> other places in the world.
>
> Quite a bit of atlas backend logic depends on probes having just one 
> controller at a time, so this is unlikely to change soon.
>
> Sending probes to different controllers happens automatically, but it happens 
> on a time scale of around 6 hours. It seems that the issue at Hetzner was 
> less than 2 hours. However, with all controllers for software probes at 
> Hetzner, a long failure at Hetzner would indeed impact all software probes.

In this case, why not have the software probes set up with a fall-back
RIPE Atlas Anchor (controller)?

E.g.
- Setup a list of 2 or more Anchors in a hierarchical order,
- If the software probe cannot reach the primary Anchor for more than
10-20 minutes., fall-back to the next Anchor reachable,
- Retest connectivity to all configured anchors every 6 hours,
- Revert to use the first reachable anchor in the locally configured
list (a hierarchical order).

* The above idea assumes an Anchor being able to temp. handle more
than the default ~500 probes per Anchor, plus co-location diversity of
>= 2 providers and >= 3 anchors.


--
Chriztoffer

Reply via email to