[Lsr] Re: Another counter-example

Tony Przygienda Thu, 05 Dec 2024 07:56:22 -0800

Blast radius is a concept that comes from security really (originally
grenades in fact ;-) and is quite commonly used to describe "size of impact
of a failure/misconfiguration/change on an entity or the entity being
compromised ". We can loosely talk in this context here about
"re-computation and change of CDS links on a topological change, i.e.
link/node failure or join/removal from an algo". It is also now used to
consider in deployment "impact of change/(mis-)configuration on a single
node onto the forwarding in the network" due to significant number of
outages within last couple of years caused by such "changes" in very large
networks.  Colloquially, one could argue that e.g. redistribution of
prefixes into a node has a significant blast radius (we know _those_
failures) or even a link failure (due to involved SPF) but IGP being IGP,
certain things are distributed computation and core functionality and
cannot be easily mitigated (though e.g. knobs to prevent excessive
redistribution via unintended configuration are common now).

On my count there were significantly more than two large operators of large
ISIS networks that chimed in already and clearly indicated they want work
on leaderless solution to happen in the WG (and multiple ones holding same
opinion chose to stay silent to my knowledge)  and the consensus call was
extended specifically for "Leaderless Flooding Algorithm for Distributed
Flood Reduction to allow reduced configuration, minimal blast radius, and
ease of incremental deployment" and not some "additional mechanism ...".

if the consensus call passes AFAIS the implications on the points below is:

* blast radius is defined above unless another definition is extended.
basically the desire is that " reconfiguration/failure of a single node
influences the minimal possible amount of other nodes in the network".
While not inherent property of a leaderless algorithm, the suggested
disttopo comes as close as we could make it to the goal unless something
else is suggested or improvements to the algorithm shown.
* unless the risks are outlined via clear technical or operational examples
or counter example to -prz- framework draft is provided I would consider
further claims to apocalyptic outcomes the moment two nodes are configured
to use different prunners lacking rationale. Yes, the framework is
restricted to algorithms being prunners (with the additional addition of
strict MUST for CDS-only-in-own-component that crystallized in this
thread). My previous email delivered the logical chain that makes the
prunner property "necessary and sufficient" to achieve interoperable
co-existence of distributed flood reduction algorithms (and even
centralized computed ones).
* having multiple algorithms or versions during transition phase addresses
the "ease of incremental deployment with minimal configuration and blast
radius" of the consensus call AFAIS

Does "leaderless" force an operator into running multiple algorithms at the
same time? It does not. After the leaderless work is done disttopo could be
e.g. added to RFC97xx as well if an operator prefers that mode of operation
for some reasons or migration can happen by simply disabling/enabling one
algorithm after another which is a minor variation on the flag day where a
leader is replaced by e.g. a network provisioning automation.

>From here on the discussion would benefit from specific technical and
operational examples of risks or unnecessary complexity in meeting the
goals of the consensus call rather than held beliefs AFAIS

thanks

-- tony

On Thu, Dec 5, 2024 at 4:09 PM Les Ginsberg (ginsberg) <[email protected]>
wrote:

> Tony –
>
>
>
> There are multiple assumptions implicit in your response.
>
>
>
> You assume that the understanding of the realities of  “blast radius” by
> all parties is accurate and correct. I believe this still requires
> examination i.e., that the actual “blast radius” associated with
> leader-based when implemented  correctly is not inevitably global.
>
>
>
> You assume that the risks associated with having multiple algorithms
> enabled in the network (either as a transient or a permanent state) have
> been fully vetted.  I think this deserves further scrutiny.
>
>
>
> You assume that there are real deployment needs to have multiple
> algorithms deployed simultaneously in a network. I believe this deserves
> further scrutiny.
>
>
>
> I believe we are closer to the beginning of this discussion than the end.
>
>
>
> The consensus call started by Acee was “whether or not we want to work on
> an additional mechanism…”.
>
> I agree that the clear consensus on that is “yes” – but what we have
> agreed to is to discuss/work – we haven’t actually done the work yet.
>
>
>
>    Les
>
>
>
>
>
> *From:* Tony Li <[email protected]> *On Behalf Of *Tony Li
> *Sent:* Thursday, December 5, 2024 6:53 AM
> *To:* Peter Psenak (ppsenak) <[email protected]>
> *Cc:* Les Ginsberg (ginsberg) <[email protected]>; Tony Przygienda <
> [email protected]>; Shraddha Hegde <[email protected]>; Robert
> Raszuk <[email protected]>; lsr <[email protected]>
> *Subject:* [Lsr] Re: Another counter-example
>
>
>
> Hi Peter,
>
>
>
> One can migrate from one algo to the other without reverting to the full
> flooding using the leader announced algo.
>
>
>
>
>
> Perhaps you missed the numerous operators who have requested a leaderless
> approach that limited the blast radius.
>
>
>
> Acee started a consensus check here:
> https://mailarchive.ietf.org/arch/msg/lsr/4HZD9pxaHMCDhfUQMtb4mepBW4Q/
>
>
>
> I have yet to see a closure of the consensus check, but IMHO, the trend is
> quite clear.
>
>
>
> Tony
>
>
>
>
>

_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Lsr] Re: Another counter-example

Reply via email to