On Thu, Jan 3, 2013 at 5:31 PM, Tony Li <[email protected]> wrote: > A syntactic error would be any inconsistency of length fields, > incompatibility between a type and a length, or any other error that would > introduce any doubt about the parsing of the message. Once this type of > error has occurred, then the remainder of the data stream is wholly in doubt. > A session reset seems like the only (conservative) way of handling this, as > it's unclear that further data would be accurate. > > I'm very open to a discussion of alternatives to session flap for these > cases. Should we require manual intervention before session restart? This > would seem reasonable.
I've already posted an alternative. The BGP MARKER makes it possible to recover Message framing. I'm not sure I think this is a good idea. I am sure I would wish for it at 2am on a Saturday morning, if it fixed my problem. So if my vendor supplied it as an option, I might be glad for that someday. > Semantic errors would be consistency violations within the contents of the > message. In these cases, treat-as-withdraw seems reasonable. The current draft for "treat-as-withdraw" has a great deal of complexity. You seem to both favor this solution, and hate the complexity. I also hate the complexity, and "ignore-bad-message" is my alternative. Would you rather the draft simply eliminate a lot of the complexity? I would, because the complexity is what creates more opportunities for session-reset. It also may catalyze bugs. None of these are "nerd knobs" when they fix your operational problem and allow you to stop bleeding money. TCP Path MTU Detection for BGP is a "nerd knob." Yet, there it is, along with plenty of other things that might tweak your router a bit; but they're not there to save your butt in a crisis. That's what error-handling should do -- save your butt. On Thu, Jan 3, 2013 at 5:54 PM, Smith, Donald <[email protected]> wrote: > Other than getting it noticed I doubt a reset will ever make a misbehaving > router stop misbehaving. Maybe not, but I've certainly seen it solve problems. For example, Cisco 6500/7600s used to send > MTU sized Ethernet frames to eBGP neighbors if TCP MD5 for BGP was in use. I had a eBGP session to a transit network which would fail every few days because the Cisco box would repeatedly try to send an illegally-large frame, my router would discard the frames, and the Cisco would keep trying to retransmit it. Eventually my HoldTime would expire. As you might imagine, this took quite some time for us to diagnose; but at least the auto-reset caused the network to return to a good state pretty quickly. > This is what happened recently. ALU announced attributes with 1 bit set that > others didn't understand. If the routers that didn't understand that bit > could have ignored just that element everything would have been fine. Actually the problematic routers that reset their sessions "MUST have ignored [this bit] when received," according to RFC4271, but they didn't do that. So the reason why those routers broke is that THEY were buggy AND the originating router was buggy. It is also arguable that routers in the DFZ should not have propagated this 1 bit. Some routers didn't while others did. So if the receiving routers weren't buggy in the first place, they would not have experienced any problem. However, ignore-bad-message or treat-as-withdraw would be a good catch-all that an operator could use, even if his router was buggy, to keep his network functioning. On Thu, Jan 3, 2013 at 6:14 PM, Tony Li <[email protected]> wrote: > The entire problem with the ignore option is that it leaves bogus information > in the network. Suppose that the update contains an AS path change for a > prefix. If you ignore the update, then you ignore that path change, AND, you > fail to propagate that change to your upstream neighbors. Now, BGP's loop > prevention mechanism is out the window and, in the worst case, you've created > an inter-domain forwarding loop. > > If, on the other hand, you withdraw that prefix, then any alternate > connectivity for that prefix can come into play. > > In short, treat-as-withdraw is a fail safe approach. Ignoring updates is not. Treat-as-withdraw is NOT fail-safe. It can create loops or blackholes in your network. Claiming over and over that it is safe will not make it true. Ignore-bad-message has bad qualities, which are just as bad as treat-as-withdraw, except it is less complicated to implement and there is a greater chance it can avoid a potentially crippling session-reset loop. -- Jeff S Wheeler <[email protected]> Sr Network Operator / Innovative Network Concepts _______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
