On Fri, Aug 8, 2025 at 2:24 AM Saku Ytti via NANOG <nanog@lists.nanog.org>
wrote:

> On Fri, 8 Aug 2025 at 12:19, Måns Nilsson via NANOG
> <nanog@lists.nanog.org> wrote:
>
> > my one advice on anycast is to make _certain_ that the routing reflects
> > service availability on individual nodes -- i.e a node that can't answer
> > queries MUST stop advertising the resolver /128 (or /32 if you have
> that).
>
> If you do this in a single ASN, where you can guarantee preferences
> are honored, then instead of pulling advertisement, deprefer it.
>
> Eventually you will manage to cause an issue, where all advertisements
> are falsely pulled.
>
> Same strategy works in any domain where you are testing if something
> works, like default route by pinging 8.8.8.8, don't pull, depref.
>


Having been bitten by this in the past...never base your determination of
"healthy" or "working" on a single external data reference.
It can be tempting to just assume 8.8.8.8 will always be "up" and
"pingable" to verify your internet connectivity is good...right up to the
point where Google has a routing snafu, and your DNS infrastructure goes
into cascading failure as every one of your sites begins depreferencing its
announcements based on the failure of the external health check, and the
load begins shifting to a smaller and smaller number of serving sites that
were slower at detecting and depreferencing their route announcements,
often to the point where the final site is so overwhelmed by all the
traffic slamming it that it can't perform healthcheck/depreferencing
anymore.

Always have at least 3 external probe destinations or health check sites,
operated by different entities, and only depreference upon failure to reach
3/3 or 2/3.  Do not make decisions about the health of your network based
upon the health of a single external entity (unless they are your only
upstream provider, or you otherwise share fate with them).

If you're pinging someone else to make sure the internet is still alive,
ping several, like 8.8.8.8, 1.1.1.1, and 9.9.9.9, and don't react unless
you see failures to reach multiple of them.  Otherwise, it's likely to be
their failure, not yours, and there's no reason to make things worse by
changing your systems based on their problems.

...so many painful lessons learned the hard way over the years...  ^_^;

Matt
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/W2SHRX2FZK7KSPSJZMVBIBANJ5EYASIE/

Reply via email to