Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Jeff Wheeler Thu, 03 Jan 2013 18:36:22 -0800

On Thu, Jan 3, 2013 at 12:40 PM, Jared Mauch <[email protected]> wrote:
> Not sure who you're asking, but I think this whole draft is an idealist 
> attempt to workaround software defects that may be uncorrectable as a whole.  
> (I say this as a $large_operator as measured here: 
> http://as-rank.caida.org/?mode0=as-ranking&n=10&ranksort=1 )
>
> The missing prefix because it was ignored, or routing loop because a withdraw 
> was ignored will quickly change these folks minds.

Yes, the existing draft contains a great deal of complexity to work
around specific problems that have been imagined.

It would be very simple to just ignore bad updates.  That is not
complicated.  It's not "good" but it's less bad than your network
being down because you received one bad update from your DFZ neighbor,
which in the case of a small network, might be one or all of their
transit providers.

Having had to support networks who were down because of this condition
recently, I understand their pain, because they were down with no hope
of being back up until external parties helped them.

> Methinks you underestimate the complexity that would be added to the error 
> handling code.  While finding a marker/0xff may be easier, understanding the 
> large block of updates in the flood of activity and low latency of large tcp 
> windows make this much harder and more prone to error.

I don't think I underestimate it at all.  I think the complexity
introduced by the error-handling draft is a bad idea.  I know that
"ignore bad updates" is extremely simple and a very broad catch-all
tool which can be utilized if there are no better options.  Having
your network be down because you can't keep BGP sessions to your
transit providers established is not an option.

> For some operators, the only chance to workaround defects is to have 
> something catastrophic happen to provide the justification to management to 
> actually pick up the $new_software that corrects the problems you've been 
> applying band-aids to.

Your position is that you don't want a feature that can mitigate
problems, because it would provide you with an opportunity to mitigate
problems instead of fix them?  Do you have to sabotage your network to
get problems fixed or capacity upgraded, too?

> I'm generally in the more-rope camp, but this is effectively throwing out 
> years of well worn code path and introducing new code (and likely defects) in 
> the handling of error cases which can only make things worse.  There wasn't a 
> broad-reaching bgp attribute problem in 2012 that i'm aware of, putting us in 
> the "there is stability in the core" camp.  I'm seeing a well intentioned but 
> unneeded element of meddling here.

LANL's announcement of invalid paths caused about 3000 prefixes to
instantly disappear from the DFZ.  This lasted for several hours.
Those missing prefixes are just networks who didn't have any working
transit at all.  Some small networks that I worked with were affected
on only some transit sessions and not others.  The affect of this is
difficult to measure, but almost 1% of the DFZ disappearing because of
one bad update demonstrates just why BGP is a very vulnerable single
point-of-failure.

> People on the edge also need to learn to maintain their devices.  This isn't 
> a standards body issue, and IMHO off-topic for here, but an important 
> datapoint.

While true, if their vendor hasn't identified and corrected a bug yet,
they don't have any software upgrade option.  I believe this was
recently the case for Alcatel during same event caused by the LANL
announcements.

On Thu, Jan 3, 2013 at 1:00 PM, Tony Li <[email protected]> wrote:
> I'm sure if you asked folks if they wanted anything that helped them and 
> completely ignored the costs, they would say yes.  However, if you want to 
> make a reasonable, rational, and justified decision, you must consider the 
> implications of the request.  That's what I'm asking.  Consider implementor 
> input as well, because there are some practical considerations here.

Do you think the error-handling draft is too complex to be worth
implementing?  I do.  That is exactly why I suggest "ignore bad
messages" as an alternative.  It is not complicated.

> Respectfully, I think you're misunderstanding my position completely.  My 
> point is that a reasonable implementation cannot possibly live up to the 
> expectations that you're setting up here.  To be specific, once an 
> implementation loses the syntactic parsing of the data stream, realistically, 
> the session is corrupt and an eventual reset is inevitable.  Or, in other 
> words, BGP cannot possible ignore bad messages.  That's not the way it works.

Of course BGP can ignore bad messages.  To say otherwise is simply
telling a lie because you haven't made a good argument.

There are three types of situations worth considering.

1) message is bad and we don't know if Message Length is, because
another MARKER hasn't yet arrived
This is very easy to deal with.  Simply ignore the message.  If a
MARKER doesn't arrive next, reset the session.  Avoids complexity.
Your code should already be able to deal with a message that contains
nothing it understands -- for example, a message with no NLRI and
nothing but an optional, non-transitive attribute that it doesn't
recognize.

1) message is bad but Message Length is fine, and the next MARKER
appears as expected
This is also very easy to deal with.  Just ignore the message.  It is
exactly the same as the above case except perhaps you waited for the
next message to start arriving before you decided what to do about the
previous, corrupt message.

2) message is bad and Message Length is bad, so the next MARKER is not
where it should be
This is a little harder to deal with.  It may be a lot harder for a
BGP implementation that pathologically integrates TCP and BGP Message
parsing.  Perhaps some implementations would choose to try to recover
without session-reset by hoping a new MARKER will arrive soon to
"re-sync" the session.  Perhaps not.  I think the value of this is
quite questionable and highly dependent on how much work it is for the
implementor -- and what he feels are the chances of introducing new
bugs.

> All of the marketing that you're doing here is positioning this as a 
> 'solution'.  It's not.  Yes, it will stop the flap, but it does NOTHING to 
> fix or deal with the underlying bug.  All it does is gloss it over, and as 
> such, it will have implications in the field whereby this papers over real 
> bugs and we have now promoted BGP errors into RIB errors.  That's NOT making 
> things easier to debug, that's just applying a band-aid.

It does do something to keep the network functioning.  Is it
functioning well?  No, of course not.  But there may be only 1 or a
few routes that are bad.

In the case of the LANL incident, 5 /24s were bad.  I can live with
not being able to reach 5 /24s that were announced with bad attribute
flags.  I can even live with a loop in my network for those /24s.  I
can't live with my network down because BGP is flapping.

You are right, this is a band-aid.  It is a very good one when you are
bleeding money.

> A more constructive way to address the real problem here would be to talk 
> about whether we should even re-establish the session after an error.  Long 
> ago, we made an implementation decision to simply retry.  That would seem to 
> be the real issue at hand.

If you have received bad messages from all your transit providers, it
will not matter if you re-establish the sessions or not, if the bad
messages keep coming.  All your transit will be down and you'll be
bleeding money.

> Sorry, but the point of the standards body is to standardize PROTOCOL 
> changes.  Everything that has been discussed here are IMPLEMENTATION ISSUES.  
> We don't standardize those, for very good reasons.  And the vendors need zero 
> help from the IETF in making implementation issues.  If real customers want a 
> particular behavior, they can always just ask for it, as always.

Would you like a list of some standards-track documents that do not
modify protocols?

>> Related to this, what is your plan for dealing with BGP Attribute
>> re-ordering in the rewrite of RFC4760?
>
> My plan?  My personal plan is to ban the use of all MP extensions, as all of 
> that is simply evil and should be scrubbed off the face of the earth.
>
> I'll be putting this in place as soon as I'm elected Emperor of the Universe. 
>  ;-)

I agree that MP-BGP is less than ideal.  That's true of many
extensions to BGP.  Sadly, there isn't BGP-5, and BGP-5 would be
needed to clean all of this up.

The reason I ask is because the work on MP-BGP makes a specific
recommendation to re-order the attributes in such a way that is the
opposite of a specific recommendation in the base BGP spec.  If other
work, such as error-handling, wants to depend on the MP-BGP
recommendation being followed, then MP-BGP should actually provide a
way for the neighbor to signal its intent to follow that
recommendation.

That can and perhaps should be done with a new Capability Code.

If that is not done, then a lot of the rules in the error-handling
draft are not useful unless error-handling itself allocates such
Capability Code.

I don't understand the reason for adding this recommendation to MP-BGP
if you don't also think there should be a mechanism by which you and
your neighbor can agree to depend on it.

Also, I'd like to repeat that a lot of the rules in error-handling are
not useful unless the neighbor supports it.  This means it might be
useful within your datacenter network but is probably not useful to
most transit or peers for a long time.

> RFC 4271 doesn't require a specific ordering because it would be bad protocol 
> design.  As soon as you require ordering, some implementation is going to 
> check that ordering.  There will be more bugs that occur because the mandated 
> ordering was not followed, and more sessions will be dropped.   As always, 
> the right thing is to follow Postel's law: be liberal in what you accept.
>
> Requiring a specific ordering in order to improve error handling is a bad 
> tradeoff: you're creating a host of additional bugs so that you can try to 
> simplify error handling on another set of bugs.

I gather that your general opinion is error-handling is bad.

In addition, you think that specifically promising to order attributes
in the way needed by error-handling is further bad.  Do you think the
recommendation for ordering attributes should be removed from
RFC4760bis?

IMO this is important.

-- 
Jeff S Wheeler <[email protected]>
Sr Network Operator  /  Innovative Network Concepts
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to