[Lsr] Re: Another counter-example

Tony Li Wed, 04 Dec 2024 09:00:31 -0800

Les,

Upgrades are the motivation for deploying multiple algorithms. It allows for 
incremental rollout of a new algorithm. Yes, there are significant operational 
considerations.


Reverting to full flooding is neither practical nor necessary.  Migration has 
the strong advantage of having a minimal blast radius, as has been requested.

Interoperability is not a serious problem as there is a boundary of legacy 
flooding between dissimilar algorithms.

Once you grasp that you have only a single algorithm within a subgraph, 
debugging gets a whole lot easier.

T

p.s. Tony and I have discussed things offline and I am hoping that he will 
revise his drafts so that they are easier to absorb.


> On Dec 4, 2024, at 8:36 AM, Les Ginsberg (ginsberg) - ginsberg at cisco.com 
> <[email protected]> wrote:
> 
> Tony –
>  
> Upgrades are orthogonal to my comments.
> I am speaking about the need to deploy multiple flooding algorithms in a 
> network (one of which may be “static”).
> We have never considered that in scope before – and there are obvious 
> challenges to doing so – not least of which is the ability to test.
>  
> I think when you say “upgrade” you are talking about needing to migrate from 
> algorithm X to algorithm Y – or from Algo X-V1 to Algo X-V2 where V2 has some 
> fix that isn’t fully interoperable with V1.
> We already have a way handling this case:
>  
> Revert to base flooding everywhere – do the upgrade – and then enable the 
> upgraded algo.
> Conceptually, this is consistent with how we have deployed major infra 
> upgrades (e.g., narrow to wide metrics).
>  
> This is far safer than trying to deal with co-existence – not least because 
> once you allow co-existence you have to allow that a customer might use this 
> as a permanent state – not just an upgrade state.
> Given the challenges we already face with interoperability even when all 
> routers are trying to “do the same thing” (and I am not limiting this comment 
> to just flooding)   the idea that we should now embrace a persistent state 
> where routers are intentionally doing inconsistent things seems at best naïve.
>  
> Imagine that you and I are called to root cause problems in a customer 
> network.
> Your implementation supports algorithm X and doesn’t understand algorithm Y.
> My implementation supports algorithm Y and doesn’t understand algorithm X.
> Flooding issues are notoriously difficult to diagnose – even when all nodes 
> are supposed to be doing the same thing.
> All the while our mutual customer is (rightfully) pressuring to get this 
> fixed ASAP.
> We might well ask “how did we get into this mess”.
>  
>    Les
>  
>  
> From: Tony Li <[email protected]> On Behalf Of Tony Li
> Sent: Wednesday, December 4, 2024 7:54 AM
> To: Les Ginsberg (ginsberg) <[email protected]>
> Cc: Tony Przygienda <[email protected]>; Peter Psenak (ppsenak) 
> <[email protected]>; Shraddha Hegde <[email protected]>; Robert Raszuk 
> <[email protected]>; lsr <[email protected]>
> Subject: Re: [Lsr] Another counter-example
>  
>  
> Les,
>  
> The step that you’re missing is that upgrades are inevitable and thus an 
> operational necessity.  
>  
> We are very, very, very unlikely to get things right on the first go. 
> Therefore, we will need to fix our bugs. How do you deploy that bug fix? Add 
> to the mix that we’re not willing to do a flag day cutover to the fix.
>  
> A better way of thinking of mesh groups is that they are the ’static routes’ 
> of legacy flooding.  They are installed by network operators and are presumed 
> to be perfect. No signaling necessary.
>  
> Tony
>  
>  
>  
> 
> 
> On Dec 4, 2024, at 7:28 AM, Les Ginsberg (ginsberg) - ginsberg at cisco.com 
> <[email protected] <mailto:[email protected]>> wrote:
>  
> I am very much in agreement with Peter – though I think his commentary is 
> “too kind”. 😊
>  
> The issue w mesh groups is that they are opaque to other nodes i.e., you may 
> come up with a way of signaling that a node has configured mesh groups (which 
> BTW the distoptflood draft does NOT currently have – and I hope it never 
> does…) but unless you are going to also propose that a node signal what links 
> are/are not being used for flooding the best you can do from the POV of other 
> nodes is treat the node as if it is running a flooding algorithm which is 
> totally opaque – and which is also “brittle” i.e., it doesn’t do well in the 
> event of topology changes.
>  
> To Tony P – one of the things that disturbs me about the way this discussion 
> is taking place is how we seem to have “skipped steps”.
>  
> The interest in optimized flooding dates back decades.
> Early attempts include:
>  
> https://datatracker.ietf.org/doc/rfc2973/ (Mesh Groups) (circa 2000)
> https://datatracker.ietf.org/doc/html/draft-ietf-ospf-isis-flood-opt-01 
> (circa 2001)
> MANET work (circa 2014)
>  
> All of these attempts were very conservative in nature. The notion of 
> deploying multiple solutions simultaneously and thinking about how they might 
> “interoperate” was quite deliberately not looked at. The general view has 
> been “be very very careful when you mess with flooding”.
>  
> Suddenly, we now seemed to “leaped off the cliff” and are talking about 
> deploying multiple algorithms and trying to get them to “interoperate”.
>  
> At what point did the WG conclude that this is a real requirement and that it 
> actually can be deployed safely?
>  
> If people want to discuss this – the WG is a fine place to do it. But I would 
> appreciate discussion that does not skip over the very real concerns that 
> have kept us from even considering this for the last three decades.
>  
>    Les
>  
>  
>  
> From: Tony Przygienda <[email protected] <mailto:[email protected]>> 
> Sent: Wednesday, December 4, 2024 12:35 AM
> To: Peter Psenak (ppsenak) <[email protected] <mailto:[email protected]>>
> Cc: Shraddha Hegde <[email protected] <mailto:[email protected]>>; 
> Robert Raszuk <[email protected] <mailto:[email protected]>>; Tony Li 
> <[email protected] <mailto:[email protected]>>; lsr <[email protected] 
> <mailto:[email protected]>>
> Subject: [Lsr] Re: Another counter-example
>  
> Valid point of view but there are other solutions possible to the whole thing 
> as well that don't precondition mesh-group node lift up, if consensus passes 
> and we start to work on details of the necessary leaderless signalling in 
> some framework that's part of operational considerations then would be my 
> take ...
>  
> thanks
>  
> -- tony
>  
> On Wed, Dec 4, 2024 at 9:25 AM Peter Psenak <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Shraddha,
> 
> so you define mesh-groups to be a separate flooding algorithm itself, 
> requiring all routers using them to be upgraded.  By the time you do that, 
> you can also replace mesh-groups with the distop on all routers and be done 
> with it, instead of trying to solve the coexistence of the two.
> 
> thanks,
> Peter
> 
> On 04/12/2024 07:48, Shraddha Hegde wrote:
> 
> Hi Robert,
>  
> With dist-opt flood reduction running in leaderless mode it is possible for 
> the operator to run
> Mesh-groups in some part of the network and introduce distopt flooding in 
> other part where needed. The nodes configured with  mesh-groups have to be 
> upgraded to advertise, they are running a different flood reduction algorithm 
> and the distopt algorithm will ensure the neighbors of the Nodes running 
> meshgroups will always become reflooders and hence the CDS where distopt 
> runs, is ensured correct flooding behaviour.
>  
> Some networks have the mesh-groups deployed where it’s a well defined part of 
> the topology and reduces 50% back-flooding with mesh-groups configured. Has 
> been deployed for many years and serving well.  If an operator wants to keep 
> that config and introduce distopt in other parts of the topology (during 
> migration or otherwise), It’s a very valid usecase and can be supported with 
> distopt algorithm.
>  
> Rgds
> Shraddha
>  
>  
> Juniper Business Use Only
> From: Robert Raszuk <[email protected]> <mailto:[email protected]> 
> Sent: 27 November 2024 15:58
> To: Peter Psenak <[email protected]> <mailto:[email protected]>
> Cc: Tony Li <[email protected]> <mailto:[email protected]>; Tony Przygienda 
> <[email protected]> <mailto:[email protected]>; lsr <[email protected]> 
> <mailto:[email protected]>
> Subject: [Lsr] Re: Another counter-example
>  
> [External Email. Be cautious of content]
>  
>  
> > you are talking about mixing the manual mesh group with optimized flooding.
>  
> I am talking about an accidental mix (legacy configuration at some nodes) not 
> a planned one. 
>  
> And you either auto detect it and disable the ability to optimally flood or 
> you push full responsibility to the operator.
>  
> Thx,
> R.
>  
> On Wed, Nov 27, 2024 at 11:16 AM Peter Psenak <[email protected] 
> <mailto:[email protected]>> wrote:
> Robert,
>  
> On 27/11/2024 10:32, Robert Raszuk wrote:
> Peter, 
>  
> My point was that this should be at least mentioned in operational 
> considerations section if dynamic flooding is expected to work in mixed 
> networks where some nodes support new algorithm and some do not your "regular 
> flooding case". 
>  
> you are talking about mixing the manual mesh group with optimized flooding. I 
> don't think we want to go that path.
> 
> thanks,
> 
> Peter
> 
>  
> 
>  
>  
> On Wed, Nov 27, 2024 at 10:28 AM Peter Psenak <[email protected] 
> <mailto:[email protected]>> wrote:
> Robert,
>  
> On 27/11/2024 10:22, Robert Raszuk wrote:
> Peter,
>  
> I am not sure if what Tony said is a requirement or an observation. 
>  
> > Note that combining routers that run the elected optimized algorithm
> > with routers that do run the regular flooding is not a problem.
>  
> Note that static mesh groups can be present today too and you can't assume 
> that it is either an optimized algorithm or full flooding.
> please do not compare apples with oranges.
> 
> Static mesh groups are manually configured and if not done correctly can 
> result in broken flooding. What we are discussing here is a dynamic flooding 
> algorithm, not manual flooding blocking.
> 
> thanks,
> Peter
> 
>  
> Thx,
> R.
>  
>  
> On Wed, Nov 27, 2024 at 9:58 AM Peter Psenak 
> <[email protected] <mailto:[email protected]>> 
> wrote:
> On 27/11/2024 00:18, Tony Li wrote:
> > A distributed algorithm computing a flooding topology must only 
> > operate upon nodes running the same algorithm (and version). If 
> > multiple algorithms (and/or versions) are running in the same network, 
> > then any given algorithm and version defines a subgraph and the 
> > algorithm can only optimize flooding within its own subgraph. Legacy 
> > full flooding must be used between subgraphs of different algorithms 
> > or versions.
> 
> This is a new requirement for the flooding algorithm itself. This does 
> not exist with the existing leader based election, as that guarantees 
> that only one optimized flooding algorithm is ever present in the area. 
> Note that combining routers that run the elected optimized algorithm 
> with routers that do run the regular flooding is not a problem.
> 
> thanks,
> Peter
> 
> _______________________________________________
> Lsr mailing list -- [email protected] <mailto:[email protected]>
> To unsubscribe send an email to [email protected] <mailto:[email protected]>
>  
> 
>  
> 
>  
> 
>

_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Lsr] Re: Another counter-example

Reply via email to