Hello
I find that the type of outage that affects our network the most is
neither of the two options you describe. As is probably typical for
smaller networks, we do not have redundant uplinks to all of our
transits. If a transit link goes, for example because we had to reboot a
router, traffic is supposed to reroute to the remaining transit links.
Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes
before everything has settled down. This is the time before everyone
else on the internet has processed that they will have to switch to your
alternate transit.
The only solution I know of is to have redundant links to all transits.
Going forward I will make sure we have this because it is a huge
disadvantage not being able to take a router out of service without
causing downtime for all users. Not to mention that a router crash or
link failure that should have taken seconds at most to reroute, but
instead causes at least 5 minutes of unstable internet.
Regards,
Baldur
Den 09/01/2017 kl. 23.56 skrev Laurent Vanbever:
Hi NANOG,
We often read that the Internet (i.e. BGP) is "slow to converge". But how slow
is it really? Do you care anyway? And can we (researchers) do anything about it?
Please help us out to find out by answering our short anonymous survey
(<10 minutes).
Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272
<https://goo.gl/forms/WW7KX5kT45m6UUM82>
** Background:
While existing fast-reroute mechanisms enable sub-second convergence upon
local outages (planned or not), they do not apply to remote outages happening
further away from your AS as their detection and protection mechanisms only
work locally.
Remote outages therefore mandate a "BGP-only" convergence which tends to be
slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
be propagated router-by-router. Our initial measurements indicate that it can
take state-of-the-art BGP routers dozens of seconds to process and propagate
these large streams of BGP UPDATEs. During this time, traffic for important
destinations can be lost.
** This survey:
This survey aims at evaluating the impact of slow BGP convergence on
operational practices. We expect the findings to increase the understanding of
the perceived BGP convergence in the Internet, which could then help
researchers to design better fast-reroute mechanisms.
We expect the questionnaire to be filled out by network operators whose job
relates
to BGP operations. It has a total of 17 questions and should take less 10
minutes
to answer. The survey and the collected data are anonymous (so please do *not*
include information that may help to identify you or your organization).
All questions are optional, so if you don't like a question or don't know the
answer,
please skip it.
A summary of the aggregate results will be published as a part of a scientific
article later this year.
Thank you so much in advance, and we look forward to read your responses!
Laurent Vanbever (ETH Zürich, Switzerland)
PS: It goes without saying that we would be also extremely grateful if you could
forward this email to any operator you might know who may not read NANOG.