Hi Pierre,
Thank you for starting the conversation and a quick intro on the
differences.
When I look at this draft and PLSN, what I see is that the PLR is
definitionally either a type B router (since
it has an alternate that is safe for forwarding traffic or for link up it's
old primary) and that the PLR is then the
only router to apply the basic procedure. However, the PLR may not have an
alternate available, unless MRT is used.
As draft-ietf-rtgwg-microloop-analysis-01 says in Sec 3.3:
" Another distinct situation is when the router does not support IPFRR or
could not repair the failure, the new primary next-hops do not satisfy the
safety condition, and there's no other neighbor that does, i.e. a type-C
situation. Unlike other routers in the network, the router directly
connected to the network does not have the old next-hop any more, and
cannot continue using it. Immediately switching to the new next-hops, on
the other hand, may result in a micro-loop. In this situation, the router
MUST discard traffic forwarded along the affected route for the duration of
DELAY_TYPEC, and then update the routes. Implementations MAY have a
configuration option to allow switching immediately to the new next-hops
for situations where this type of a micro-loop is not a concern. If
implemented, this option MUST be disabled by default."
Granted, this discarding becomes the default behavior
for draft-litkowski-rtgwg-uloop-delay-00, but the reasoning and trade-offs
are not discussed.
In the analysis given in draft-litkowski-rtgwg-uloop-delay-00, the benefit
discussed is only in terms of local
microloops and completely ignores non-local microloops. I know that this
particular technique is not solving
the remote microloops problem - but those are a real problem and without
even attempting to characterize that,
there's little way of telling whether the local microloops are 1% of the
problem or 99%.
That the technique can apply when only the PLR does it is not as
interesting as having a more general technique
that works for traffic from routers that implement it and does not cause
problems.
Obviously, the WG debated this issue quite some time ago and was willing to
go for a simpler partial solution (PLSN)
over OFIB that gave similar coverage to RLFA.
Is your current argument that this even simpler and more partial a solution
might gain some traction? Or is it that this
was simpler to implement and provides some mitigation?
In addition to lacking any guidance on the scale of the total problem that
it solves, the draft also lacks details to handle
the cases where the network hasn't been stable. Granted, the latter is not
deeply complex - but the solution isn't safely
usable without it.
I think that we as a WG need to do 4 things:
a) Understand the scope of the total microloop problem and what
fraction of this that draft-litkowski-rtgwg-uloop-delay-00 actually can
solve. Does it handle asymmetric link-costs and multi-hop micro-loops?
Better examples of what types of local microloops are handled and why
other types aren't protected would be useful. How would an operator be
certain as to what protection would be provided or how to engineer a
network to obtain it?
b) Have a draft that fully describes the problem, the trade-offs, and
the solution in detail rather than just a brief conceptual overview.
c) Understand the computation and complexity trade-offs between the
different solutions - given that LFA is already assumed for it to be useful.
d) Discuss how partial a solution is desirable to standardize and the
pros/cons of having a worse solution standardized. Implementations aren't
free - and by standardizing a more partial solution, this can delay
implementations of a better solution.
I understand the desire to standardize something and to take something that
seems straightforward and is likely useful to at least one network, but
given the WG track record, at a minimum, I think we must have a more
complete draft that fully documents the solution in detail and compares it
fairly.
Regards,
Alia
On Mon, May 20, 2013 at 7:57 AM, Pierre Francois
<[email protected]>wrote:
>
>
> Dear rtgwg list members,
>
> I would like to know your opinion about what we should do with
> http://tools.ietf.org/html/draft-litkowski-rtgwg-uloop-delay-00 , that we
> presented in Orlando.
>
> The idea was to avoid microloops occurring in the direct neighbourhood of
> a node shutting down or bringing up a link in an IGP topology, by
> introducing some
> fixed delay in the update of the FIB in the down case, and introducing a
> fixed delay in the propagation of the LSP describing the link as up in the
> up case.
>
> The solution is simple, will be released by some in the upcoming months,
> and the Orlando audience was seeming to find it interesting to work on.
>
> Alia mentioned the interest of comparing this solution with the state of
> the art before going further with the doc, so here it comes.
>
> Generally, compared to other solutions, local-delay does not provide full
> coverage, as it only avoids all (but only) microloops occurring locally to
> the affected node. However,
> in many networks, as shown by Stephane's analysis, it is already highly
> beneficial to have loop avoidance there. Considering the simplicity of the
> approach,
> this looks like a low hanging fruit.
>
> Alia was considering a comparison with PLSN. (described in
> http://tools.ietf.org/html/draft-ietf-rtgwg-microloop-analysis-01,
> expired 7 years ago ;) )
>
> The differences with the PLSN approach are the following:
>
> PLSN lets all routers having to converge for some destinations, try to
> understand the safety of their new next hops, for each destination.
> Based on this assessment, they either
>
> 1. Transiently use a safe, non post-convergence, set of next hops, to
> finally converge to the post-convergence one, or
> 2. Transiently use old next-hops, to finally converge to the
> post-convergence ones.
>
> Local delay can be defined as a subset of this approach:
> Only the node local to the event applies the procedure.
> Step 1 in PLSN is not applied, we only suggest the node to wait for a
> fixed time, no transient FIB state.
>
> I was considering a comparison with oFIB, draft-ietf-rtgwg-ordered-fib ,
> submitted to IESG as informational.
> local-delay can be defined as a subset of this approach:
>
> While oFIB defines an ordering among all the nodes of the network, telling
> which node should wait for which neighbours to be done with their update,
> before performing their own, local-delay tells the local node to wait
> before fast convergence has happened in the rest of the network.
>
> I think that despite the close relationships between these approaches,
> local-delay is worth being documented on its own because:
>
> It's simple, on its way to be supported, and provides loop avoidance where
> they happen to be the most annoying.
>
> Cheers,
>
> Pierre.
>
>
> _______________________________________________
> rtgwg mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/rtgwg
>
_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg