[quagga-dev 12326] Re: [PATCH 2/5] bgpd: strip incorrect Graceful Restart R-bit code

Paul Jakma Sun, 17 May 2015 10:36:07 -0700

On Sun, 17 May 2015, David Lamparter wrote:

good job at both. Particularly, I think it provides a better networkchurn reduction result than the approach your patch is suggesting.

I personally think that assertion might not be right. We can work throughexamples to figure it out. With a bit of time, I might be able to runsimulations to determine this exactly.

Note that the fine-grained approach allows more grouped, even global,approaches to be implemented on top. If you can defer to one peer, you candefer to many together too.

The converse isn't true. If the currently implemented per-peer defermentcapability is completely removed, it's gone - there's no way to get itback.

And there's Non-Stop Forwarding, i.e. keeping forwarding state, to getrid of that flap entirely. That only works if the router is doing areasonably controlled reboot (either by user command, or by awell-designed crash/panic handler). Both of these are single-systemshort-time events.


We don't do NSF.

Further, the global delay of route-calculation, for /all/ peers, is likelyungood for the routing of the local speaker even with NSF, IMO.

The update-delay removes a nice convergence feature.


It's a bug, and here's why.

1.  it generates network churn by exiting cork mode on a per-peer base,
which is simply too early:

Assume you have a router starting up with 4 peers.  They all differ in
their speed & processing power:

"<" = session establishment, numbers = routes/updates, ">" = EoR
A     <123456789>
B    <1 2 3 4 5 6 7 8 9>
C          <1  2  3  4  5  6  7  8  9>
D                 <123456789>

A will start receiving updates quite early, in particular at a point
where it is the only source for everything numbered 6 and above.  It
can, worst-case, receive 3 updates about prefixes 6789;  it can receive
2 updates about prefixes 345.

This only gets worse when there are more peers, and worse yet of someone
doesn't send updates in ascending table order.


Why is this worse?

Remember, in general, you have:

  0...n
  S S S
   \|/
    R
   /|\
  R R R
  0...m

Sending updates from R to S, before the R sub-set has converged, is indeedbad. It would allow those 3 updates to escape back to the S-set. However,no one is proposing to allow that to happen. ;)

Sending updates ASAP from the R to R is *good* - you /want/ the R-subsetto converge as quickly as possible.

Ideally for convergence, to my knowledge anyway, you want the R-subset tofully converge on the state from within the R-subset, and the state fromS. However, there is no good way to detect when that has happened withcurrent BGP (and even if you could change BGP as you wished, it's stillnearly always a subjective question to ask from within a distributed,non-global, network protocol - unless you manage to get yourimplementation libomniscience fully debugged ;) ).

In lieu of that, you want to propagate state within the R-set as /fast/ aspossible. So that the R-set at least has a chance of converginginternally, before any R-router has all its S-router neighbours finishsending - at which point routing information will flow from that R to thatS and on.

E.g. with the update delay patch, the following will happen. The R-routerssend EoR to all peers immediately. On receipt of which, those R-routerswhich have only other R-router peers go /out/ of update-delay mode. E.g.in the example above, say that all m of the R-neighbours in the lower rowhave no other S-peers. They become S-peers of the central router:


  0...n
  S S S
   \|/
    Ra
   /|\
  S S S
  0...m

I.e. there are now m...(n+m) S neighbours, the 0...n original S-set of thetop-row, and the 0...m 'false' S-set of the lower row.

However, the central router still does not run its route-selection andstill does not pass information on to any of thepreviously-R-and-still-incomplete peers of the lower-row.

This means that there could be routing information within the originalR-subset that is sitting, bottle-necked, at Ra, causing blackholes orloops.

Eventually the original set of S-peers (the top-row 0...n ones) finishsending, and finally Ra can go out of update-delay. Only now can anyrouting information bottle-necked there propogate on to the lower-rowS-set (which aren't really 'stable' by the other criteria you gave, of afull table, as they still could have a partial, even empty RIB).

If instead you do it at a more fine-grained level, i.e. if you *don't*block routing selectiong-and-propogation between the R-set, then thishappens instead. You still get the lower row going "S" - as far as Ra isconcerned. However, a distinction is still clear between the original,"true" S-set peers and the "false" S-peers:


  0...n
  S S S
   \|/
  --R--
  |   |
  --S--
   /|\
  S S S
  0...m

In my patches it was even more fine-grained at a peer-specific level, butI never had an objection on grouping it a bit more to the above (some ofmy earlier mails mentioned "stable" and "restarted" domains, cause I wasintending to work toward that, before the thread seemingly went off therails). There's a good discussion to be had on what kind of criteria wouldbe sensible, if we could get there. ;) E.g. it's quite feasible to groupthem as above.

Now, route-selection still occurs, and UPDATEs flow happily between thelower peers - aiding converge in these.

However, so long as the upper "true" S-set peers are still not finishedsending, UPDATEs do *not* flow towards them. The router defers updates toS-set peers (per-peer or as a set), until all routing information has comein from them. Which is exactly the wire behaviour of what the GR RFC wasintended to achieve.

(The GR RFC does not mention the restart-restart case specifically at all.Indeed, you have to thread together text from different parts of the RFCto figure out that case, with a dose of implication required. Note that nowhere does the GR RFC explicitly say that a restarted peer must defersending updates to another restarted peer).

2.  it doesn't work with non-stop forwarding.
keeping forwarding state would translate on Quagga to a zebra & bgpdrestart while keeping routes in the kernel untouched. The implicationhere is that other routers will keep the previous state in their tableand replace it only when they receive incoming updates, finally removingstale nonrefreshed entries on seeing EoR.

First we'd have to get bgpd restart, with zebra still running, workingnicely.

This isn't possible to do on a per-peer level.  We can't send our
current table to one peer - sending that table means we must have the
state installed, or we're lying/actively breaking BGP! - while still
claiming to another peer that we're using our pre-restart state.

Being able to defer UPDATEs at a more fine-grained level than a globalcondition still permits deferring at a global level. If local-NSF requiredglobal.

OTOH, if you remove the ability to defer at a fine-grained level, soUPDATEs can only be deferred globally, then obviously you've lost theability to do the former.

The finer-grained level doesn't stop global-condition deferment. Whileremoving the finer-grained capability and stopping the wholeroute-selection process clearly is less flexible.

(This is what I mean when I say "The R bit is intended to be based on
wallclock time.")

So, I really hate timers. Sometimes they're unavoidable, but I reallyprefer to have something more sensible, based on the facts of what hasbeen happening.

For a global delay, you can not avoid a bounding timer. Otherwise, if 1peer in the condition-set comes up slowly, /all/ peers are delayed becauseof that 1 peer. So then you must have a bounding timer.

With this global delay it then becomes very difficult to enable GR bydefault. Because there'll be never a timer that suits everyone. Too high,and it'll mostly work without causing issues for anyone - except those oddtimes when some peer comes up slowly, and your routing stays broken forlonger than it should do. That kind of behaviour is very annoying to dealwith as an admin.

Too low, and the timer will fire before most sessions are up, and bgpdwon't stop deferring and start sending UPDATEs to S-peers well before mosthave finished sending routes.

Realistically, we'd have to set the bounding timer really low, if we wereto enable this by default. Because that degrades to "normal BGP" in theworst case, which would be less of a regression than "your routing takesmuch longer to come up than normal, every now and then".

With peer-specific GR, we don't need to pick a timer. Sessions come uplike, with S as the start, F as the finish of the table-dump/EoR, '-' asan UPDATE.


1  S-----------------F -   -    -- --  -   -  --- -  - - - --- - --
2           S--------------F - --- -- - -- -- --- -- -- --- -  - -

3 S---------------F- - --- --- -- --- -- -- - -4 S-----------------F - --- -etc.


Peer 1 can receive updates from t=1F.

Peer 2 can only receive updates starting from t=2F, consisting of theUPDATEs on prefixes received from 1, 2, 3.


Peer 3 can only receives updates from t=3F, on the set from 1,2,3

In the best case there's a good bit of overlap on most peers for theinitial S-F dumps, and much network-level UPDATE churn is avoided.

In the worst case there's no overlap. The worst-case delay for anypeer at all can be no worse than the time it takes that peer tosend its RIB.

However, this still *always* avoids the churn where the local speakersends an UPDATE for a path that it would prefer the remote peer's -because the remote peer will still *always* send its path first. So thelocal speaker /always/ will have that remote speaker's path (if any) inits RIB when deciding, for any prefix it could send to that remotespeaker.

It is *guaranteed*, even with this peer-granular deferral, that as routescomes in, that once the local speaker sends an UPDATE to the remotespeaker, that the local speaker will not send a withdraw for that prefixto that remote-peer - even as updates come in from other peers later(assuming no more disconnects anywhere, and ignoring various oscillationpathologies in BGP that GR does not and can not fix).

With a more clumped grouping, e.g. the original-S-peers andoriginal-R-peers above, you could guarantee no routing information canflow from the R-set to the S-set, until all the S-set peers have finishedsending. However, you would then still need a bounding timer - but itcould at least still allow the R-set to converge freely, without delay.

Note, the finer-grained approaches assume you can still runroute-selection, so you can send stuff to the set you're not applyingdelays to. That aspect could cause more CPU churn. That's trading offnetwork convergence goodness for CPU churn. That can be fixed by settingthe granularity to global though.

Apparently I failed in communicating my intent to put together full GR.


Communicating complex things is difficult.

There's been a lot of communication failure, from both sides, in thisthread. Though, I'd assert some of it started in the very first reply tomy patches, which didn't help in setting the tone. I'm quite bad atresponding like-with-like, particularly.

There are techniques we could adopt to try avoid this and make thingsbetter for everyone.

"Send UPDATEs without artificial delays when it's perfectly acceptable" is
abusing people's networks?
Please stop twisting my words to your liking. I complained that itisn't acceptable to push experimental, undocumented, unreviewed newprotocol inventions into Quagga. That would be using our users'networks as testbeds.


It's not. It's normal BGP! :)

If speaking normal BGP and sending UPDATEs to a peer without delay (as pernormal BGP) is using users as test-beds we'd better stop shipping bgpd. :)

Unfortunately, as pointed out in the "bug, not feature" section, it'snot "perfectly acceptable" to send these updates; in fact it willgenerate extra churn.


What extra churn though? There is no extra churn relative to BGP-4.

Further, even with the peer-specific granularity, even in the worst case,it filters out the worst transients (sending a route that the remote-peerhas a better path for before you've got it, leading toUPDATE-then-WITHDRAW to that remote peer), and the maximum-delay for anyremote peer is no longer than the time it takes that remote-peer to sendits RIB to you.

The peer-granularity way could be set as default, because it can do prettymuch no harm, relative to normal BGP, while still filtering out somechurn.

It can also be made to accommodate a more clumped approach (either intotwo sets so the R-set-peers can be updated immediately, orglobal-deferment), though that will need a timer.

The global update-delay way implemented as a queue-plug on all processing,then we can never have the fine-grained way. While the fine-grained waystill allows for the global-delay way.

Note, we can have other criteria. E.g. timers are difficult to pick goodvalues for. Maybe there are other criteria (could even extend capabilitiesto exchange some preliminary data to help with this, if that wouldactually be useful).

Had this different, new churn-reduction approach been passed through 1)writing a spec, and 2) posting that spec to the IETF idr WG, I believethe issue would very likely have been caught.

It's compatible with the RFC, as far as any single remote speaker isconcerned. At worst, it degrades to "plain old BGP" - exactly asupdate-delay does if the timer happens to be too low.

RFCs do not specify internal implementation. They specify what is requiredfor interoperability. They may specify a model for state, but that is doneto define the semantic meaning of network messages. That does *not* meanevery internal implementation must exactly follow the given state modelthough!

Otherwise, any BGP implementation that doesn't keep a separate AdjIn,AdjOut and LocRIB can never be BGP. Which surely isn't true. E.g. Quaggadoes not have a separate AdjIn. I'm sure other major implementations don'tfollow the BGP RFC state model to the letter either.

Also, nothing stops us describing from implementing this relaxation of GR,which is "You don't need to apply the GR update-deferral to peers where itisn't needed, if you prefer fast convergence - just send straight away asper normal BGP".

BGP can and does evolve, at least sometimes because vendors implementcompatible optimisations and then describe them. Or in this case, a "don'tapply an optimisation to cases where it doesn't help".

I wonder if Zebra changed their behaviour in the time since we picked upthat code from them.

GNU Zebra? Don't think there was any change there before it disappeared.IPInfusion ZebOS: no idea.


regards,
--
Paul Jakma      [email protected]  @pjakma Key ID: 64A2FF6A
Fortune:
Your heart is pure, and your mind clear, and your soul devout.

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

[quagga-dev 12326] Re: [PATCH 2/5] bgpd: strip incorrect Graceful Restart R-bit code

Reply via email to