On Sun, 17 May 2015, David Lamparter wrote:

good job at both. Particularly, I think it provides a better network churn reduction result than the approach your patch is suggesting.

I personally think that assertion might not be right. We can work through examples to figure it out. With a bit of time, I might be able to run simulations to determine this exactly.

Note that the fine-grained approach allows more grouped, even global, approaches to be implemented on top. If you can defer to one peer, you can defer to many together too.

The converse isn't true. If the currently implemented per-peer deferment capability is completely removed, it's gone - there's no way to get it back.

And there's Non-Stop Forwarding, i.e. keeping forwarding state, to get rid of that flap entirely. That only works if the router is doing a reasonably controlled reboot (either by user command, or by a well-designed crash/panic handler). Both of these are single-system short-time events.

We don't do NSF.

Further, the global delay of route-calculation, for /all/ peers, is likely ungood for the routing of the local speaker even with NSF, IMO.

The update-delay removes a nice convergence feature.

It's a bug, and here's why.

1.  it generates network churn by exiting cork mode on a per-peer base,
which is simply too early:

Assume you have a router starting up with 4 peers.  They all differ in
their speed & processing power:

"<" = session establishment, numbers = routes/updates, ">" = EoR
A     <123456789>
B    <1 2 3 4 5 6 7 8 9>
C          <1  2  3  4  5  6  7  8  9>
D                 <123456789>

A will start receiving updates quite early, in particular at a point
where it is the only source for everything numbered 6 and above.  It
can, worst-case, receive 3 updates about prefixes 6789;  it can receive
2 updates about prefixes 345.

This only gets worse when there are more peers, and worse yet of someone
doesn't send updates in ascending table order.

Why is this worse?

Remember, in general, you have:

  0...n
  S S S
   \|/
    R
   /|\
  R R R
  0...m

Sending updates from R to S, before the R sub-set has converged, is indeed bad. It would allow those 3 updates to escape back to the S-set. However, no one is proposing to allow that to happen. ;)

Sending updates ASAP from the R to R is *good* - you /want/ the R-subset to converge as quickly as possible.

Ideally for convergence, to my knowledge anyway, you want the R-subset to fully converge on the state from within the R-subset, and the state from S. However, there is no good way to detect when that has happened with current BGP (and even if you could change BGP as you wished, it's still nearly always a subjective question to ask from within a distributed, non-global, network protocol - unless you manage to get your implementation libomniscience fully debugged ;) ).

In lieu of that, you want to propagate state within the R-set as /fast/ as possible. So that the R-set at least has a chance of converging internally, before any R-router has all its S-router neighbours finish sending - at which point routing information will flow from that R to that S and on.

E.g. with the update delay patch, the following will happen. The R-routers send EoR to all peers immediately. On receipt of which, those R-routers which have only other R-router peers go /out/ of update-delay mode. E.g. in the example above, say that all m of the R-neighbours in the lower row have no other S-peers. They become S-peers of the central router:

  0...n
  S S S
   \|/
    Ra
   /|\
  S S S
  0...m

I.e. there are now m...(n+m) S neighbours, the 0...n original S-set of the top-row, and the 0...m 'false' S-set of the lower row.

However, the central router still does not run its route-selection and still does not pass information on to any of the previously-R-and-still-incomplete peers of the lower-row.

This means that there could be routing information within the original R-subset that is sitting, bottle-necked, at Ra, causing blackholes or loops.

Eventually the original set of S-peers (the top-row 0...n ones) finish sending, and finally Ra can go out of update-delay. Only now can any routing information bottle-necked there propogate on to the lower-row S-set (which aren't really 'stable' by the other criteria you gave, of a full table, as they still could have a partial, even empty RIB).

If instead you do it at a more fine-grained level, i.e. if you *don't* block routing selectiong-and-propogation between the R-set, then this happens instead. You still get the lower row going "S" - as far as Ra is concerned. However, a distinction is still clear between the original, "true" S-set peers and the "false" S-peers:

  0...n
  S S S
   \|/
  --R--
  |   |
  --S--
   /|\
  S S S
  0...m

In my patches it was even more fine-grained at a peer-specific level, but I never had an objection on grouping it a bit more to the above (some of my earlier mails mentioned "stable" and "restarted" domains, cause I was intending to work toward that, before the thread seemingly went off the rails). There's a good discussion to be had on what kind of criteria would be sensible, if we could get there. ;) E.g. it's quite feasible to group them as above.

Now, route-selection still occurs, and UPDATEs flow happily between the lower peers - aiding converge in these.

However, so long as the upper "true" S-set peers are still not finished sending, UPDATEs do *not* flow towards them. The router defers updates to S-set peers (per-peer or as a set), until all routing information has come in from them. Which is exactly the wire behaviour of what the GR RFC was intended to achieve.

(The GR RFC does not mention the restart-restart case specifically at all. Indeed, you have to thread together text from different parts of the RFC to figure out that case, with a dose of implication required. Note that no where does the GR RFC explicitly say that a restarted peer must defer sending updates to another restarted peer).

2.  it doesn't work with non-stop forwarding.

keeping forwarding state would translate on Quagga to a zebra & bgpd restart while keeping routes in the kernel untouched. The implication here is that other routers will keep the previous state in their table and replace it only when they receive incoming updates, finally removing stale nonrefreshed entries on seeing EoR.

First we'd have to get bgpd restart, with zebra still running, working nicely.

This isn't possible to do on a per-peer level.  We can't send our
current table to one peer - sending that table means we must have the
state installed, or we're lying/actively breaking BGP! - while still
claiming to another peer that we're using our pre-restart state.

Being able to defer UPDATEs at a more fine-grained level than a global condition still permits deferring at a global level. If local-NSF required global.

OTOH, if you remove the ability to defer at a fine-grained level, so UPDATEs can only be deferred globally, then obviously you've lost the ability to do the former.

The finer-grained level doesn't stop global-condition deferment. While removing the finer-grained capability and stopping the whole route-selection process clearly is less flexible.

(This is what I mean when I say "The R bit is intended to be based on
wallclock time.")

So, I really hate timers. Sometimes they're unavoidable, but I really prefer to have something more sensible, based on the facts of what has been happening.

For a global delay, you can not avoid a bounding timer. Otherwise, if 1 peer in the condition-set comes up slowly, /all/ peers are delayed because of that 1 peer. So then you must have a bounding timer.

With this global delay it then becomes very difficult to enable GR by default. Because there'll be never a timer that suits everyone. Too high, and it'll mostly work without causing issues for anyone - except those odd times when some peer comes up slowly, and your routing stays broken for longer than it should do. That kind of behaviour is very annoying to deal with as an admin.

Too low, and the timer will fire before most sessions are up, and bgpd won't stop deferring and start sending UPDATEs to S-peers well before most have finished sending routes.

Realistically, we'd have to set the bounding timer really low, if we were to enable this by default. Because that degrades to "normal BGP" in the worst case, which would be less of a regression than "your routing takes much longer to come up than normal, every now and then".

With peer-specific GR, we don't need to pick a timer. Sessions come up like, with S as the start, F as the finish of the table-dump/EoR, '-' as an UPDATE.

1  S-----------------F -   -    -- --  -   -  --- -  - - - --- - --
2           S--------------F - --- -- - -- -- --- -- -- --- -  - -
3 S---------------F- - --- --- -- --- -- -- - - 4 S-----------------F - --- - etc.

Peer 1 can receive updates from t=1F.

Peer 2 can only receive updates starting from t=2F, consisting of the UPDATEs on prefixes received from 1, 2, 3.

Peer 3 can only receives updates from t=3F, on the set from 1,2,3

In the best case there's a good bit of overlap on most peers for the initial S-F dumps, and much network-level UPDATE churn is avoided.

In the worst case there's no overlap. The worst-case delay for any peer at all can be no worse than the time it takes that peer to send its RIB.

However, this still *always* avoids the churn where the local speaker sends an UPDATE for a path that it would prefer the remote peer's - because the remote peer will still *always* send its path first. So the local speaker /always/ will have that remote speaker's path (if any) in its RIB when deciding, for any prefix it could send to that remote speaker.

It is *guaranteed*, even with this peer-granular deferral, that as routes comes in, that once the local speaker sends an UPDATE to the remote speaker, that the local speaker will not send a withdraw for that prefix to that remote-peer - even as updates come in from other peers later (assuming no more disconnects anywhere, and ignoring various oscillation pathologies in BGP that GR does not and can not fix).

With a more clumped grouping, e.g. the original-S-peers and original-R-peers above, you could guarantee no routing information can flow from the R-set to the S-set, until all the S-set peers have finished sending. However, you would then still need a bounding timer - but it could at least still allow the R-set to converge freely, without delay.

Note, the finer-grained approaches assume you can still run route-selection, so you can send stuff to the set you're not applying delays to. That aspect could cause more CPU churn. That's trading off network convergence goodness for CPU churn. That can be fixed by setting the granularity to global though.

Apparently I failed in communicating my intent to put together full GR.

Communicating complex things is difficult.

There's been a lot of communication failure, from both sides, in this thread. Though, I'd assert some of it started in the very first reply to my patches, which didn't help in setting the tone. I'm quite bad at responding like-with-like, particularly.

There are techniques we could adopt to try avoid this and make things better for everyone.

"Send UPDATEs without artificial delays when it's perfectly acceptable" is
abusing people's networks?

Please stop twisting my words to your liking. I complained that it isn't acceptable to push experimental, undocumented, unreviewed new protocol inventions into Quagga. That would be using our users' networks as testbeds.

It's not. It's normal BGP! :)

If speaking normal BGP and sending UPDATEs to a peer without delay (as per normal BGP) is using users as test-beds we'd better stop shipping bgpd. :)

Unfortunately, as pointed out in the "bug, not feature" section, it's not "perfectly acceptable" to send these updates; in fact it will generate extra churn.

What extra churn though? There is no extra churn relative to BGP-4.

Further, even with the peer-specific granularity, even in the worst case, it filters out the worst transients (sending a route that the remote-peer has a better path for before you've got it, leading to UPDATE-then-WITHDRAW to that remote peer), and the maximum-delay for any remote peer is no longer than the time it takes that remote-peer to send its RIB to you.

The peer-granularity way could be set as default, because it can do pretty much no harm, relative to normal BGP, while still filtering out some churn.

It can also be made to accommodate a more clumped approach (either into two sets so the R-set-peers can be updated immediately, or global-deferment), though that will need a timer.

The global update-delay way implemented as a queue-plug on all processing, then we can never have the fine-grained way. While the fine-grained way still allows for the global-delay way.

Note, we can have other criteria. E.g. timers are difficult to pick good values for. Maybe there are other criteria (could even extend capabilities to exchange some preliminary data to help with this, if that would actually be useful).

Had this different, new churn-reduction approach been passed through 1) writing a spec, and 2) posting that spec to the IETF idr WG, I believe the issue would very likely have been caught.

It's compatible with the RFC, as far as any single remote speaker is concerned. At worst, it degrades to "plain old BGP" - exactly as update-delay does if the timer happens to be too low.

RFCs do not specify internal implementation. They specify what is required for interoperability. They may specify a model for state, but that is done to define the semantic meaning of network messages. That does *not* mean every internal implementation must exactly follow the given state model though!

Otherwise, any BGP implementation that doesn't keep a separate AdjIn, AdjOut and LocRIB can never be BGP. Which surely isn't true. E.g. Quagga does not have a separate AdjIn. I'm sure other major implementations don't follow the BGP RFC state model to the letter either.

Also, nothing stops us describing from implementing this relaxation of GR, which is "You don't need to apply the GR update-deferral to peers where it isn't needed, if you prefer fast convergence - just send straight away as per normal BGP".

BGP can and does evolve, at least sometimes because vendors implement compatible optimisations and then describe them. Or in this case, a "don't apply an optimisation to cases where it doesn't help".

I wonder if Zebra changed their behaviour in the time since we picked up that code from them.

GNU Zebra? Don't think there was any change there before it disappeared. IPInfusion ZebOS: no idea.

regards,
--
Paul Jakma      [email protected]  @pjakma Key ID: 64A2FF6A
Fortune:
Your heart is pure, and your mind clear, and your soul devout.

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

Reply via email to