[quagga-dev 12324] Re: [PATCH 2/5] bgpd: strip incorrect Graceful Restart R-bit code

David Lamparter Sun, 17 May 2015 06:21:46 -0700

On Fri, May 15, 2015 at 02:02:10PM +0100, Paul Jakma wrote:
> On Fri, 15 May 2015, David Lamparter wrote:
> > Ooooh.  I think I found the source of our different approaches.
> 
> Yes, you're designing to minimise CPU churn. I had an eye on the network 
> churn, which is what GR was intended to do. From the introduction:


Your notion that I'm primarily concerned about CPU churn is a rather
erisian stratagem.  I did note on IRC that startup CPU churn is a bad
issue, particularly on routeservers.  Extending that into a dichotomy of
CPU vs. network churn is rather underhanded.

I would appreciate if we could refrain from argumentative tactics like
this and your prior event.

I'm concerned about both CPU and network churn, and I believe full GR as
specified in the RFC and implemented in the patchset I mailed does a
good job at both.  Particularly, I think it provides a better network
churn reduction result than the approach your patch is suggesting.

>    "Usually, when BGP on a router restarts, all the BGP peers detect that
>     the session went down and then came up.  This "down/up" transition
>     results in a "routing flap" and causes BGP route re-computation,
>     generation of BGP routing updates, and unnecessary churn to the
>     forwarding tables.  It could spread across multiple routing domains.
>     Such routing flaps may create transient forwarding blackholes and/or
>     transient forwarding loops.  They also consume resources on the
>     control plane of the routers affected by the flap.  As such, they are
>     detrimental to the overall network performance."

And there's Non-Stop Forwarding, i.e. keeping forwarding state, to get
rid of that flap entirely.  That only works if the router is doing a
reasonably controlled reboot (either by user command, or by a
well-designed crash/panic handler).  Both of these are single-system
short-time events.

> > It has the side effect of doing something useful for startup, but it
> > isn't trying to combat generic churn from router outages.
> 
> The RFC does seem to be concerned with minimising network churn, see the 
> intro. The ordering of table dumps, and making the restarted router 
> converge first is intended to help with that.

Of course it's concerned with minimising network churn.  Just not
generically in any random situation.  It's the specific situation of a
single router restart that it tries to make nicer.

> It also mentions unplanned outages.

... once, in a subordinate clause about forwarding state, and it's
saying one might not want to set "forwarding state preserved" flag after
an unplanned outage.

(really?)

> > RFC 4724, Section 4.1:
> >
> >                                  [...]  However, it MUST defer route
> >   selection for an address family until it either (a) receives the
> >   End-of-RIB marker from all its peers (excluding the ones with the
> >   "Restart State" bit set in the received capability and excluding the
> >   ones that do not advertise the graceful restart capability) or (b)
> >   the Selection_Deferral_Timer referred to below has expired.  It is
> >   noted that prior to route selection, the speaker has no routes to
> >   advertise to its peers and no routes to update the forwarding state.

(I see no opposition/counterargument to the RFC prescribing global
update corking?)

> > You're designing a different approach.  I would suggest you take it to
> > the idr WG @ IETF.
> 
> Except that's what the code has now.
> 
> The update-delay removes a nice convergence feature.

It's a bug, and here's why.

1.  it generates network churn by exiting cork mode on a per-peer base,
which is simply too early:

Assume you have a router starting up with 4 peers.  They all differ in
their speed & processing power:

"<" = session establishment, numbers = routes/updates, ">" = EoR
A     <123456789>
B    <1 2 3 4 5 6 7 8 9>
C          <1  2  3  4  5  6  7  8  9>
D                 <123456789>

A will start receiving updates quite early, in particular at a point
where it is the only source for everything numbered 6 and above.  It
can, worst-case, receive 3 updates about prefixes 6789;  it can receive
2 updates about prefixes 345.

This only gets worse when there are more peers, and worse yet of someone
doesn't send updates in ascending table order.


2.  it doesn't work with non-stop forwarding.

keeping forwarding state would translate on Quagga to a zebra & bgpd
restart while keeping routes in the kernel untouched.  The implication
here is that other routers will keep the previous state in their table
and replace it only when they receive incoming updates, finally removing
stale nonrefreshed entries on seeing EoR.

This isn't possible to do on a per-peer level.  We can't send our
current table to one peer - sending that table means we must have the
state installed, or we're lying/actively breaking BGP! - while still
claiming to another peer that we're using our pre-restart state.


> > +bgp_update_delay_active (struct bgp *bgp)
> > +{
> > +  if (bgp->t_update_delay)
> > +    return 1;
> > +  return 0;
> > +}
> 
> The timer is a max-delay (potentially quite long). Until then it acts as a 
> flag that may be disabled by changes to the state (sensible).

The spec indirectly caps it through the restart time field which is 12
bit and in seconds.  That comes down 68 minutes.  What it actually
suggests "is a value less than or equal to the HOLDTIME carried in the
OPEN".  That'd be quite short.

While there is no direct connection between that field and the timer, I
would argue there is a logical one since one'd be waiting for a longer
time than any other system would possibly be keeping NSF state for one.

(This is what I mean when I say "The R bit is intended to be based on
wallclock time.")

> I don't have a problem with that. Behaving sensibly based on state is 
> precisely what I was arguing for. I don't think there's a good way to pick 
> the max-delay though, nor do I think it necessary to stall all UPDATEs.
> 
> > then I really don't know what glasses you're reading my mails with.
> 
> The original timer wasn't sensible. Making it state-based is sensible to 
> me. I thought we were having a discussion about going in that direction.

Apparently I failed in communicating my intent to put together full GR.

Also, I'm made the assumption that a timer with an early expiry option
is still a timer.  That's, unfortunately, the way this associates in my
brain, particularly when I'm thinking about a short-lived timer.

I thought my mail Date: Tue, 5 May 2015 17:32:25 +0200 was reasonably
understandable, particular with these two paragraphs:

# (I'm not saying the current code is correct -- as you may have noticed
# I spotted a mistake in my understanding of the mechanism, so I need to
# re-check whether the current behaviour is valid, broken, incomplete,
# or something else.)

and

# After all, the entire point of the entire thing is cut short the wait
# for the expiry of the deferral timer, by merit of having all sessions
# up and EoR'd.  (or R=1, or !GR.)  Worst case it will self-fix by
# expiry of the deferral timer [in a correct implementation].

> > I'm still arguing we should implement RFC 4724.  You seem to be arguing
> > we should try and design a new churn-reduction scheme.
> 
> It's not new - it's in the existing code. To remove it would be a 
> convergence regression.

(See earlier comments about this being a bug, not a feature.)

> What matters here is that on the wire the "Restarting Speaker" is seen to 
> wait for the EoR of the "Receiving Speaker", and that no speaker gets into 
> deferal dead-locks because of this feature.

To reiterate:  what matters here is waiting per-AF to get a reasonably
complete set of information, before sending out updates.  This is
per-AF, not per-peer.

> > That's just abusing our users' networks for testbeds.
> 
> "Send UPDATEs without artificial delays when it's perfectly acceptable" is 
> abusing people's networks?

Please stop twisting my words to your liking.  I complained that it
isn't acceptable to push experimental, undocumented, unreviewed new
protocol inventions into Quagga.  That would be using our users'
networks as testbeds.

Unfortunately, as pointed out in the "bug, not feature" section, it's
not "perfectly acceptable" to send these updates;  in fact it will
generate extra churn.

Had this different, new churn-reduction approach been passed through 1)
writing a spec, and 2) posting that spec to the IETF idr WG, I believe
the issue would very likely have been caught.

I wonder if Zebra changed their behaviour in the time since we picked up
that code from them.


-David

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

[quagga-dev 12324] Re: [PATCH 2/5] bgpd: strip incorrect Graceful Restart R-bit code

Reply via email to