Re: [Lsr] Questions on draft-white-lsr-distoptflood

Tony Przygienda Fri, 25 Nov 2022 01:06:35 -0800

Les, bits delay since I had to think a bits about your comment to do it
justice and it's bit long'ish

1. So, to start with a cut and dry summary and reasoning for it, I am
firmly against adding signaling to the whole thing by some means (or rather
any procedures to act upon distribution of info about the algorithm used by
any of the nodes involved, i.e. I'm ok with having the algorithm advertised
*solely* for info purposes with me though I don't see what function it
serves except detecting nodes that do not reduce yet in transition of a
network or maybe, as you say, detect algorithm mismatch). More detailed
reasoning follows:

a. First reason is the fact that the additional flexibility of maybe having
one day some better hash algorithm will add *very* serious amount of
complexity in implementation/behavior in case we are talking about adding
it to the centralized variant of the dynamic flooding draft and having a
leader advertising the algorithm.
    i. backup machinery needs to be added/spec'ed properly. What does the
network do if backup has different algorithm than the current leader? First
we would have a transition phase, some nodes have old algorithm, some the
old, network may stop converging for a bit that way, worst case we
partition the PGL algorithm advertisement from new nodes so we have to wait
CSNP * diameter etc. Big network bleep is the result. I know there is lots
verbiage in the dynamic flooding draft but I know the reality of
implementations of such things and they are extraordinarily high for the
bit flexibility the whole thing would buy us I see you suggesting.
   ii. What happens if PGL doesn't say anything? Default algorithm? Full
flooding again? in case of full-flooding-regression all of a sudden one fat
finger on PGL (or PGL moving unexpectedly due to fat finger/some other node
config changes) can basically crash your network and worst case stop
convergence if reduction allowed before to converge but full flooding
seriously slows down everything. I know, this would be a network tethering
on the edge already but why have additional daemons hiding in a single
point of failure on top.
  iii. lots of remaining subtle things. e.g. to make sure the whole thing
works each node havs to compute reachability to the leader (not sure that's
in the dynamic flooding draft now), otherwise they may use stable LSPs from
a leader that is gone/partitioned. This reachability computation will have
adverse effects. The timing is unpredictable in the network and may lead to
problems mentioned in i).   If nodes don't do the reachability we may end
up in Paxos unintentionally BTW.

Generally, I can claim that I lived the PGL in ATM so I've seen the
"central leader in IGP" game. Not excited about it from experience and it
was much easier in ATM already due to hard state of SVCs. To sum it up
again, I see here a suggestion to add massive amount of
complexity/fragility for an assumed, unspecified benefit in the future. As
footnote: centralization in an IGP a cardinal sin in my eyes moving away
from the first premise that made distributed routing so successful. I spoke
against it and still hold the same opinion and if that's heresy I'm more
than happy to be bumped off the author's list of the dynamic-flooding draft
;-).

so maybe as iv) here:  WHAT additional variables in the hash do you imagine
would constitute a _better_ algorithm? AFAIS there are none I can imagine
and the current algorithm provides pretty much best entropy with clearly
cap'ed state per node needed to balance per LSP originator/fragment. So
instead of "pledging for flexibility for flexibilitity's sake" I'd rather
see you suggesting something that would change/improve the behavior in the
future/now in concrete terms and then let's talk about specifics.

b. Then, as second reason when talking towards a distributed solution, i.e.
each node flooding the algorithm it uses. We still do NOT know what to do
in case nodes will advertise different algorithms each, no matter it's
advertised or not. Shut down the network, fall back to full flooding if one
node disagrees (which makes every node a potential attack vector)? We had
that kind of discussion before, last on multi-TLV where you were insisting
on killing the cap indication so it would be funny to add it here.
Complexity without any concrete benefit whatsoever AFAIS and lots of
ratholes again.

2. To go to your reliable PSNP/CSNP objection now. First, they were never
reliable. Neither were LSPs. We can make a very fine argument that if
PSNPs/CSNPs are not reliable then ISIS will not converge at all. We can
start to argue then how many we lose and when and how one variation of
flooding is "more robust" than other and we can actually discover that if
the redundancy factor in graph is higher than the largest fanout than we
are in normal ISIS and hence the reduced flooding redundancy factor (in
extreme case it's basically infinity for existent flooding algorithm in
ISIS) + PSNP unreliability are two variables (plus network radius +
origination rates + etc) which in extreme case can be shown to not converge
the network anymore no matter the flooding (e.g. if the re-origination rate
+ radius is higher than the propagation time under CSNP/PSNP losses).  In
short, the objection brings nothing new to the table, Les, this has been
around forever and we're talking here about the fact that any flooding
reduction makes flooding "less" reliable somewhat. That's trivia.

b. to more productive arguments: the solution does NOT reduce the full CSNP
advertisement and this will fix any bug in an algorithm. We agree that far
I think.

3. Then, let's have the up-to-date PSNP in glossary with a better name,
e.g. "consistency assuring PSNP" or CA-PSNP which describes better what it
is. It cannot hurt

It goes like this (which I thought was already decently clear in the draft
but nothing wrong in spelling that out)

a) the algorithm figures out during computation that LSP-ID X/fragment Y is
NOT flooded on since other RNL members took over. Now, the according LSP-ID
X/fragment Y is put on PSNP queue of all the members in TN that are your
neighbors (optimization here) or as the draft says "all your neighbors"
which is bits too conservative.  Flood out those PSNPs on a second timer
unless they were killed during normal ISIS processing rules or already went
out.  Observe that NO changes are made to normal ISIS CSNP/LSP/PSNP
processing here except dropping those PSNPs into the according queues to go
out. If the neighbor gets the PSNP and interprets it as something newer,
normal procedures kick in. If it already has it nothing will happen really
per normal procedures.  If your implementation is very conservative you can
choose yourself super conservative constants, e.g. unless you see tripple
coverage by other RNLs you flood nevertheless. Or if it turns out you send
PSNPs to your neighbors in expectation that they covered the TNLs and you
get requests back, either the other TNLs are dead slow or something is off
and an alarm can be given as in "flooding reduction here struggles".
Nothing to do with this solution, this will happen on any type of flood
reduction, chokepoints may get created (and observe that this draft load
balances flooding and not only reduces, one of the lessons I learned
implementing those things in my earlier lives ;-)

So, to sum up the argument chain, I err on the side of simplicity here
since from experience, simplicity allows us to deploy and stand
straight-faced in front of customers with very large, dense networks. This
draft is something  that consists of 12 pages including examples and about
4-5 pages boilerplate. And on top bases on old clean work and pretty much
e'thing in it proven by implementation and previous art IME. This vs. an
adopted design-by-comittee draft of 46 pages that at this point in time I
think does not standardize any interoperability but standardizes how to
find out why things don't interoperate due to all possible combinations of
centralized vs. distributed plus bring your own algorithm on top by every
vendor (based on my last read of it) ...

-- tony

On Wed, Nov 23, 2022 at 1:14 AM Les Ginsberg (ginsberg) <ginsberg=
[email protected]> wrote:

> Draft authors -
>
> The WG adoption call reminded me that I had some questions following the
> presentation of this draft at IETF 114 which we decided to "take to the
> list" - but we/I never did.
> Looking at the minutes, there was this exchange:
>
> <snip>
> Les:           I'm not convinced that you don't need to advertise
>                whether a node needs support this. If not, why not define
>                this as an algorithm and use the dynamic flooding?
> Tony P:        First bring me a case why we need to signal this.
> Les:           If I'm not going to flood and I'm expecting someone else
>                to flood, and I don't know whether we're in sync.
> Tony:          Think it through, the mix with old nodes just fine. The
>                old guy still do the full flooding and that's fine.
> Les:           You use the term up-to-date PSNP, I have no idea how you
>                determine whether the PSNP is "up-to-date"? unlike CSNP,
>                PSNP doesn't have the info.
> Tony:          You have to list all those things.
> Les:           Let's take it to the list.
> <end snip>
>
> Question #1: Why not define this as an algorithm and use
> draft-ietf-lsr-dynamic-flooding (in distributed mode)?
> This question is of significance both from a correctness standpoint and
> what track (Informational or Standard) the draft should target.
>
> Tony P's reply above suggests this isn't needed - but I don't think this
> is true. The draft itself says in Section 2.1:
>
> <snip>
> Once this flooding group is determined, the members of the flooding
>    group will each (independently) choose which of the members should
>    re-flood the received information.  Each member of the flooding group
>    calculates this independently of all the other members, but a common
>    hash MUST be used across a set of shared variables so each member of
>    the group comes to the same conclusion.
> <end snip>
>
> If a "common hash MUST be used across a set of shared variables" (and I
> agree that it MUST) then all nodes which support the optimization MUST
> agree to use the same algorithm. Given that there are likely many hash
> algorithms which could be used, some way to signal the algorithm in use
> seems to be required.
> By publishing a given algorithm(including the hash) and having it assigned
> an identifier in the registry defined in
> https://www.ietf.org/archive/id/draft-ietf-lsr-dynamic-flooding-11.html#section-7.3
> - and using the Area Leader logic defined in the same draft, consistency is
> achieved.
> Without that, I don't think this is guaranteed to work.
>
> Note the issue here has nothing to do with legacy nodes - I agree with
> Tony P's comment above that legacy nodes do not present a problem - they
> just limit the benefits.
>
> Question #2: Please define and demonstrate how "up-to-date PSNPs" work to
> recover from flooding failures.
>
> We know that periodic CSNPs robustly address this issue - and their use
> has been recommended for flooding reduction solutions over the years.
> Please more completely define "up-to-date PSNPs" and spend some time
> demonstrating how they are guaranteed to work - and consider in that
> discussion that transmission of SNPs of either type is not 100% reliable.
>
> Thanx.
>
>     Les
>
> _______________________________________________
> Lsr mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/lsr
>

_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Questions on draft-white-lsr-distoptflood

Reply via email to