On Jan 3, 2013, at 2:35 PM, Michael Long wrote:

> 
> On Jan 3, 2013, at 10:00 AM, Tony Li <[email protected]> wrote:
>> 
>> 
>> All of the marketing that you're doing here is positioning this as a 
>> 'solution'.  It's not.  Yes, it will stop the flap, but it does NOTHING to 
>> fix or deal with the underlying bug.  All it does is gloss it over, and as 
>> such, it will have implications in the field whereby this papers over real 
>> bugs and we have now promoted BGP errors into RIB errors.  That's NOT making 
>> things easier to debug, that's just applying a band-aid.
> 
> I understand what you are saying and I agree 100%, however, from an my 
> operations perspective the "fix" is the same. Either upgrade to fixed code or 
> policy out the offending announcement. I would rather deal with a customer 
> routing issue vs a frantic call from our noc saying 15+ att peers globally 
> are bouncing. The latter being a much bigger impact on our network. 

I'm very concerned with the case of ignoring a route update and having a 
month-long discussion about why some route is missing from the $carrier_a 
network when it's being sent from $carrier_b and they show it going out just 
fine.

You don't know there's an issue until someone reports it and your long-tail to 
problem resolution takes forever.

> I can live with a couple of /24's not working for a few customers. I can't 
> have 15+ peers bouncing because of bad updates and even more peers bouncing 
> because of missed keepalives due to cpu pegged trying to deal with 15 peers 
> bouncing globally. 

While related, this is an implementation defect on the part of vendors and 
their poorly optimized TCP and BGP implementations being unable to get their 
basic job done.  I recall vendors blaming our "slow" system CPU then finally 
fixing their logic defect that always returned 1 or 0 when it thought it was 
idle.  (sometimes those if statements look really complex).

>> A more constructive way to address the real problem here would be to talk 
>> about whether we should even re-establish the session after an error.  Long 
>> ago, we made an implementation decision to simply retry.  That would seem to 
>> be the real issue at hand.
> 
> I would back this provided adequate logging as to why the session is down. It 
> would be much like tripping max-prefixes where we could hard clear a single 
> single session for debug. I could live with this. 

I certainly agree there needs to be better logging from the vendors.

I remain convinced that attempts to address this problem will create more 
complex situations vs provide the desired result of a stable BGP core.

- Jared
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Reply via email to