[rtgwg] draft-dong-fantel-problem-statement

Reshad Rahman Fri, 02 Jan 2026 10:51:54 -0800

Hi,
I took a look at the doc and here are some comments/questions. FYI I haven't 
caught up to all discussions/threads...
Some high level comments first:- I support this work!- I assume congestion 
avoidance is not part of this effort? i.e. is upstream handling of the fast 
notifications (e.g. PFC, DCQCN etc) outside of the scope of this document?
Section 3---------
       What is needed is a lightweight signaling method
      that can provide real-time alerts (e.g., at the level of sub-
      milliseconds or milliseconds) on failures, congestion, or
      threshold breaches, enabling immediate actions (e.g., in ms to 10s
      ms ranges) in the network layer.
I think "10s" could be misinterpreted as ten seconds (instead of tens of ms), 
so spell it out out e.g. "in milliseconds to tens of milliseconds"
Section 4---------
   Therefore, this draft focuses on mechanisms capable of operating
   within these millisecond/sub-millisecond ranges, rather than
   mechanisms whose latency spans tens or hundreds of milliseconds,
   which are insufficient for preventing transient overload under rapid
   traffic transitions.
Here it says that tens of milliseconds is not good enough, so that contradicts 
what is in section 3?
Section 4.1-----------
I believe Fig 1 is not the best representation of the problem space since the 
local failure should be detected quickly? Consider adding a node upstream of 
the failure location, that new node would need fast notification of the failure.


Section 4.1.1-------------
   *  BFD [RFC5880]: Provides fast forwarding path failure detection.
      It can be used for both link and path failure detection, while it
      cannot be used to detect link or path congestion, nor can it
      notify the failure or congestion to other nodes in the network.
For "other nodes", clarify that it's nodes other than the BFD endpoints?
      BFD is preconfigured with periodic message exchange, while fast
      notifications needs to be event-driven.  When the transmit
      interval is set to a small value (e.g., at the level of ms),
      frequent BFD message exchange may become a burden to some systems.
Some platforms can't do BFD at "low ms" interval but I'm assuming that the 
platforms of interest here would have no issue supporting that?
   *  FRR [RFC4090][RFC5714]/Route convergence: Without fast
      notification, the failure detection can take tens of milliseconds,
      followed by either local repair (FRR) or route convergence.  The
      former lacks of global network situation thus may cause congestion
      on the backup paths, while the latter may breach strict
      synchronization deadlines.
Local repair (FRR) is local in that the fast reroute occurs locally (at the 
point of failure detection). But planning for the backup paths is not 
necessarily just a local matter, i.e. the local node has the global view?
Section 4.1.2-------------
   *  Action-Oriented Response: Upon receiving the notification, routing
      and load balancing mechanisms could instantly shift traffic to
      backup paths or alternative DC interconnects.

That could also cause more congestion elsewhere in the network? If multiple 
nodes get the fast notification and they all decide (at around the same time) 
to use e.g. some alternate paths with highest cost, those paths may also be 
congested? Does that mean that rerouting needs to be triggered from a 
centralized entity (but that means extra delay in reacting to the event)? Is 
that also outside the scope of this document?
Regards,Reshad.

_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[rtgwg] draft-dong-fantel-problem-statement

Reply via email to