[rtgwg] Re: draft-dong-fantel-problem-statement

Dongjie (Jimmy) Sun, 04 Jan 2026 23:23:00 -0800

Hi Reshad,

Happy new year!

Thanks a lot for your review comments, please see some replies inline with 
[Jie]:

From: Reshad Rahman <[email protected]>
Sent: Saturday, January 3, 2026 2:51 AM
To: RTGWG <[email protected]>
Subject: [rtgwg] draft-dong-fantel-problem-statement

Hi,

I took a look at the doc and here are some comments/questions. FYI I haven't 
caught up to all discussions/threads...

Some high level comments first:
- I support this work!

[Jie] Thanks for your support!

- I assume congestion avoidance is not part of this effort? i.e. is upstream 
handling of the fast notifications (e.g. PFC, DCQCN etc) outside of the scope 
of this document?

[Jie] Yes, although there is one sub-section describing the possible actions to 
the fast notifications, the detailed action mechanisms are considered out of 
scope.

Section 3
---------

      What is needed is a lightweight signaling method

      that can provide real-time alerts (e.g., at the level of sub-

      milliseconds or milliseconds) on failures, congestion, or

      threshold breaches, enabling immediate actions (e.g., in ms to 10s

      ms ranges) in the network layer.

I think "10s" could be misinterpreted as ten seconds (instead of tens of ms), 
so spell it out out e.g. "in milliseconds to tens of milliseconds"

[Jie] Thanks for catching this, we will rephrase the text to avoid confusion.

Section 4

---------

   Therefore, this draft focuses on mechanisms capable of operating

   within these millisecond/sub-millisecond ranges, rather than

   mechanisms whose latency spans tens or hundreds of milliseconds,

   which are insufficient for preventing transient overload under rapid

   traffic transitions.

Here it says that tens of milliseconds is not good enough, so that contradicts 
what is in section 3?

[Jie] Basically what we want to say is that the notification needs to be in the 
order of sub-milliseconds or milliseconds, so that the action can be finished 
in milliseconds to tens of milliseconds. We will update the text in section 3 
and 4 to make this clearer.

Section 4.1

-----------

I believe Fig 1 is not the best representation of the problem space since the 
local failure should be detected quickly? Consider adding a node upstream of 
the failure location, that new node would need fast notification of the failure.

[Jie] Thanks for your suggestion, I agree this figure can be modified to better 
reflect the problem space.

Section 4.1.1

-------------

   *  BFD [RFC5880]: Provides fast forwarding path failure detection.

      It can be used for both link and path failure detection, while it

      cannot be used to detect link or path congestion, nor can it

      notify the failure or congestion to other nodes in the network.

For "other nodes", clarify that it's nodes other than the BFD endpoints?

[Jie] Yes, will clarify this in next revision.

      BFD is preconfigured with periodic message exchange, while fast

      notifications needs to be event-driven.  When the transmit

      interval is set to a small value (e.g., at the level of ms),

      frequent BFD message exchange may become a burden to some systems.

Some platforms can't do BFD at "low ms" interval but I'm assuming that the 
platforms of interest here would have no issue supporting that?

[Jie] We plan to remove the text about the possible burden for running BFD at 
ms interval, as this only impacts the time of failure detection, which is not 
the focus of this document.

   *  FRR [RFC4090][RFC5714]/Route convergence: Without fast

      notification, the failure detection can take tens of milliseconds,

      followed by either local repair (FRR) or route convergence.  The

      former lacks of global network situation thus may cause congestion

      on the backup paths, while the latter may breach strict

      synchronization deadlines.

Local repair (FRR) is local in that the fast reroute occurs locally (at the 
point of failure detection). But planning for the backup paths is not 
necessarily just a local matter, i.e. the local node has the global view?

[Jie] The point of local repair may have the view of the topology, while it may 
not have the congestion/failure information of other paths.

Section 4.1.2

-------------

   *  Action-Oriented Response: Upon receiving the notification, routing

      and load balancing mechanisms could instantly shift traffic to

      backup paths or alternative DC interconnects.

That could also cause more congestion elsewhere in the network? If multiple 
nodes get the fast notification and they all decide (at around the same time) 
to use e.g. some alternate paths with highest cost, those paths may also be 
congested? Does that mean that rerouting needs to be triggered from a 
centralized entity (but that means extra delay in reacting to the event)? Is 
that also outside the scope of this document?

[Jie] What you described is a problem to be considered when an notification is 
sent to multiple recipients. We plan to add some text to capture this in an 
upcoming revision. The specific solution to this problem will be specified in 
an solution document. Coordination by a centralized entity is one possible way, 
while as you mentioned it would cause extra delay.

Best regards,

Jie

Regards,

Reshad.

_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[rtgwg] Re: draft-dong-fantel-problem-statement

Reply via email to