On Fri, May 15, 2015 at 08:51:34AM +0100, Paul Jakma wrote:
> My issue with it is that this is optimising for one specific case. One
> router restarts (R in the centre),
Ooooh. I think I found the source of our different approaches.
RFC 4724 Graceful Restart is designed to support software updates and
maybe hardware maintenance on routers. That's why it talks about
"preserve forwarding state during BGP restart", if the hardware &
software supports keeping the FIB intact while the software reboots.
This is administrative action, not "there's a power outage and 20
routers fail."
It has the side effect of doing something useful for startup, but it
isn't trying to combat generic churn from router outages.
It's also why it's bound by a timer; the behaviour is intended for the
very specific situation, once the software update is installed &
running, either you get back into normal operation, or something's
broken and the router goes into "normal" broken.
> Stopping all RIB processing is fine in the first case (you could argue
> this single-router R-domain is more common, and it's fine for
> route-servers, but it's less good in general for routing convergence.
RFC 4724, Section 4.1:
[...] However, it MUST defer route
selection for an address family until it either (a) receives the
End-of-RIB marker from all its peers (excluding the ones with the
"Restart State" bit set in the received capability and excluding the
ones that do not advertise the graceful restart capability) or (b)
the Selection_Deferral_Timer referred to below has expired. It is
noted that prior to route selection, the speaker has no routes to
advertise to its peers and no routes to update the forwarding state.
Update-delay seems to be for all AFs at once, so we don't exactly do
this to the letter, but I'd argue "close enough" with possibility for
a future improvement.
> What I would like is to defer UPDATEs only from the R-domain to S-domain
> peers. This is what the current code does.
You're designing a different approach. I would suggest you take it to
the idr WG @ IETF.
[removed more paragraphs of the same]
> > This patchset is pretty exactly RFC 4724, aside from the Non-GR
> > Keepalive thing. Peers in M will never wait, thus "go first". Peers in
> > L will wait until they think they have a reasonable view of the network.
> > That includes not waiting for other peers in L.
>
> If delaying, it sends EoR immediately though, doesn't it? That's not RFC
> GR, and that needn't play nice with other implementations.
Refer to my mail Date: Tue, 5 May 2015 16:38:51 +0200.
R=1 is the same as "EoR on all families"; this is fully intended by
the spec to remove restarting routers from the waiting-on list. Also
mentioned in above paragraph I cited.
> > > Also, I have expertise elsewhere in network protocol analysis,
> > > convergence optimisation and churn reduction, and formal
> > > verification of same.
> > >
> > > Maybe we should try working with each other.
> >
> > Yeah, maybe we should do that... instead of arguing higher authority...
>
> I'm not arguing to higher authority. I'm asking that you stop treating me
> as if I'm an idiot, and at least _listen_ to me.
I'm asking the same thing, you don't seem to be listening to me. You,
however, switched to a meta-argument that contributes nothing to the
matter at hand, instead being only useful to get one opinion a higher
weight than the other. That's not a valid move in a discussion between
equals.
> After a (good) off-list discussion, and me posting a follow-up to
> address a concern you raised, which you reply to again arguing the need
> for a timer, you then post your own patch which, gosh, goes and removes
> the startup timer and makes the R-bit timer be, gosh, dependent on
> state, just as I had argued.
>
> WTF dude? Is that working together?
WTF dude, the timer is still there, it's called t_update_delay.
+bgp_update_delay_active (struct bgp *bgp)
+{
+ if (bgp->t_update_delay)
+ return 1;
+ return 0;
+}
If you can't see that this is what I've argued in
# Date: Tue, 5 May 2015 17:32:25 +0200
#
# Yes, we need to pick up the patch linked above. For some reason, that
# patch uses a separate timer; the timers should probably be one and the
# same.
#
then I really don't know what glasses you're reading my mails with.
> - We had a (productive I thought) discussion on IRC. We didn't resolve
> everything, but I thought we were going in the right direction, and I
> thought it became clear that we had different use-cases in mind. That
> you were concerned with minimising CPU churn on the restarted router,
> while I was concerned with minimising transient churn on the
> non-restarted router.
>
> The conclusion of that seemed to be for me to post a patch to fix the
> concrete issue you raised (restricting to the startup peers) and we
> could discuss further. I did that, you reply again with NACK on all the
> patches. WTF, how is that working together?
*sigh*
I was pointing to an issue that was obviously wrong with your patch.
You weren't listening to other arguments back then, as you are now. I
was happy to have found a problem that you actually acknowledged.
Unfortunately, the underlying disagreement on what we're aiming for to
begin with persists.
I'm still arguing we should implement RFC 4724. You seem to be arguing
we should try and design a new churn-reduction scheme.
To cite your position from the src-dst discussion (which I don't make
my own opinion by citing here):
> So, it's cool stuff, but I would join Michael in thinking this needs
> more time to develop outside of master to see the demand before
> committing to it.
The difference is, the src-dst stuff has been looked at by a 2-digit
number of people @ IETF, and has been implemented by others. Your
churn-reduction protocol for BGP is completely new with no written
specification and no review at all.
If you want to implement it in Quagga, sure - but implementing it in
Quagga doesn't get you any interop testing, doesn't result in adequate
feedback from other qualified people who either operate networks or have
written the other BGP implementations you need to interact with. While
a Quagga implementation can (maybe should) be part of the creation
process of a new protocol, it shouldn't be the very first thing.
That's just abusing our users' networks for testbeds.
-David
_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev