On Thu, Jan 3, 2013 at 3:18 PM, Robert Raszuk <[email protected]> wrote: > How are you going to clean the NLRIs in your network (both transit or > stub) which were withdrawn in the messages your BGP implementation > declared "bad" and decided to ignore ?
I can fix them later, maybe even after I've had time to fully analyze the problem and get a software update from my vendor. Maybe I'll try a refresh or a session-reset, but I won't be at the mercy of repeatedly flapping session and phone ringing off the hook with angry customers! A lot of folks are thinking about this problem in the context of the big carrier who doesn't want a hard-to-diagnose problem of 1 RIB entry being wrong. That's okay, it is one way to think about it. A second way to think of it is as a small/regional ISP. If one or more of his transits are flapping because of a bad path on the DFZ, that is going to cost him money and customers. If he has no way to mitigate it, he is at the mercy of external parties. He could just use "ignore bad messages" and at least stop bleeding money. He does not care if he can't reach 5 /24s at LANL, they are unimportant to him. What is important is if he has any customers left next week. A third way is the small- or medium-datacenter network. Imagine you are a typical small/medium shop and you have some Cisco/Juniper/Brocade stuff for your ASBRs and your core, but you bought a bunch of RainbowPoop Router Co switches for your racks, because they are inexpensive and they support EVPN, L3VPN, VPLS, or some other feature you want but Cisco/Juniper/Brocade don't put into their inexpensive product. So your network looks like this: ISP1 ISP2 CISCO JUNIPER | \/ | /\ \ | | / \ \ | TOR1 TOR2 .... TOR99 Now imagine your JUNIPER supports NewVpnThing and that's a feature you decided to use on the RainbowPoop TOR devices. But TOR1 sends a bad BGP update. JUNIPER knows about NewVpnThing and sees a bad BGP attribute (that it recognizes) so it does whatever the NewVpnThing spec says, and tears down the session to TOR1. CISCO on the other hand, does not know about NewVpnThing so this router doesn't even understand the update is bad. It just passes it along to TOR2 .. TOR99. Now those boxes all tear down their session to the CISCO. Then they re-establish. Then they go down again. They keep on doing this and the network is freaking out. By the time your in-house clue notices, your symptom is that 99 identical TORs are flapping their BGP to your CISCO. You probably don't even notice the 1 TOR that is flapping to JUNIPER. Maybe JUNIPER even logs something helpful but you may not investigate it for a while. So your CISCO which is following the base spec is carrying a buggy update to your 99 other RainbowPoop TORs and they are all failing. Your JUNIPER which knows about the NewVpnThing is following its spec and protecting the other TORs from this problem, but it is probably not helpful since your network is in chaos from all the flapping. What do you do? Call vendor support. Probably for CISCO and RainbowPoop. Well, now you are expecting the TAC of Cisco and the TAC of RainbowPoop to cooperate, which they'll have trouble doing; and it may take ages before anyone identifies the root cause of the problem is really TOR1. There are going to be a lot of RainbowPoop routers in the future, and many of them may use BGP. We should make BGP more robust. -- Jeff S Wheeler <[email protected]> Sr Network Operator / Innovative Network Concepts
_______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
