On Thu, Jan 3, 2013 at 3:18 PM, Robert Raszuk <[email protected]> wrote:
> How are you going to clean the NLRIs in your network (both transit or
> stub) which were withdrawn in the messages your BGP implementation
> declared "bad" and decided to ignore ?

I can fix them later, maybe even after I've had time to fully analyze the
problem and get a software update from my vendor.  Maybe I'll try a refresh
or a session-reset, but I won't be at the mercy of repeatedly flapping
session and phone ringing off the hook with angry customers!

A lot of folks are thinking about this problem in the context of the big
carrier who doesn't want a hard-to-diagnose problem of 1 RIB entry being
wrong.  That's okay, it is one way to think about it.

A second way to think of it is as a small/regional ISP.  If one or more of
his transits are flapping because of a bad path on the DFZ, that is going
to cost him money and customers.  If he has no way to mitigate it, he is at
the mercy of external parties.  He could just use "ignore bad messages" and
at least stop bleeding money.  He does not care if he can't reach  5 /24s
at LANL, they are unimportant to him.  What is important is if he has any
customers left next week.

A third way is the small- or medium-datacenter network.  Imagine you are a
typical small/medium shop and you have some Cisco/Juniper/Brocade stuff for
your ASBRs and your core, but you bought a bunch of RainbowPoop Router Co
switches for your racks, because they are inexpensive and they support
EVPN, L3VPN, VPLS, or some other feature you want but Cisco/Juniper/Brocade
don't put into their inexpensive product.

So your network looks like this:

ISP1    ISP2

  CISCO  JUNIPER
  |    \/
  |    /\                \   |
  |   /  \                \  |
  TOR1    TOR2    ....    TOR99

Now imagine your JUNIPER supports NewVpnThing and that's a feature you
decided to use on the RainbowPoop TOR devices.  But TOR1 sends a bad BGP
update.  JUNIPER knows about NewVpnThing and sees a bad BGP attribute (that
it recognizes) so it does whatever the NewVpnThing spec says, and tears
down the session to TOR1.

CISCO on the other hand, does not know about NewVpnThing so this router
doesn't even understand the update is bad.  It just passes it along to TOR2
.. TOR99.  Now those boxes all tear down their session to the CISCO.  Then
they re-establish.  Then they go down again.  They keep on doing this and
the network is freaking out.

By the time your in-house clue notices, your symptom is that 99 identical
TORs are flapping their BGP to your CISCO.  You probably don't even notice
the 1 TOR that is flapping to JUNIPER.  Maybe JUNIPER even logs something
helpful but you may not investigate it for a while.

So your CISCO which is following the base spec is carrying a buggy update
to your 99 other RainbowPoop TORs and they are all failing.  Your JUNIPER
which knows about the NewVpnThing is following its spec and protecting the
other TORs from this problem, but it is probably not helpful since your
network is in chaos from all the flapping.

What do you do?  Call vendor support.  Probably for CISCO and RainbowPoop.
 Well, now you are expecting the TAC of Cisco and the TAC of RainbowPoop to
cooperate, which they'll have trouble doing; and it may take ages before
anyone identifies the root cause of the problem is really TOR1.

There are going to be a lot of RainbowPoop routers in the future, and many
of them may use BGP.  We should make BGP more robust.

-- 
Jeff S Wheeler <[email protected]>
Sr Network Operator  /  Innovative Network Concepts
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Reply via email to