Hi,
I took a look at the doc and here are some comments/questions. FYI I haven't
caught up to all discussions/threads...
Some high level comments first:- I support this work!- I assume congestion
avoidance is not part of this effort? i.e. is upstream handling of the fast
notifications (e.g. PFC, DCQCN etc) outside of the scope of this document?
Section 3---------
What is needed is a lightweight signaling method
that can provide real-time alerts (e.g., at the level of sub-
milliseconds or milliseconds) on failures, congestion, or
threshold breaches, enabling immediate actions (e.g., in ms to 10s
ms ranges) in the network layer.
I think "10s" could be misinterpreted as ten seconds (instead of tens of ms),
so spell it out out e.g. "in milliseconds to tens of milliseconds"
Section 4---------
Therefore, this draft focuses on mechanisms capable of operating
within these millisecond/sub-millisecond ranges, rather than
mechanisms whose latency spans tens or hundreds of milliseconds,
which are insufficient for preventing transient overload under rapid
traffic transitions.
Here it says that tens of milliseconds is not good enough, so that contradicts
what is in section 3?
Section 4.1-----------
I believe Fig 1 is not the best representation of the problem space since the
local failure should be detected quickly? Consider adding a node upstream of
the failure location, that new node would need fast notification of the failure.
Section 4.1.1-------------
* BFD [RFC5880]: Provides fast forwarding path failure detection.
It can be used for both link and path failure detection, while it
cannot be used to detect link or path congestion, nor can it
notify the failure or congestion to other nodes in the network.
For "other nodes", clarify that it's nodes other than the BFD endpoints?
BFD is preconfigured with periodic message exchange, while fast
notifications needs to be event-driven. When the transmit
interval is set to a small value (e.g., at the level of ms),
frequent BFD message exchange may become a burden to some systems.
Some platforms can't do BFD at "low ms" interval but I'm assuming that the
platforms of interest here would have no issue supporting that?
* FRR [RFC4090][RFC5714]/Route convergence: Without fast
notification, the failure detection can take tens of milliseconds,
followed by either local repair (FRR) or route convergence. The
former lacks of global network situation thus may cause congestion
on the backup paths, while the latter may breach strict
synchronization deadlines.
Local repair (FRR) is local in that the fast reroute occurs locally (at the
point of failure detection). But planning for the backup paths is not
necessarily just a local matter, i.e. the local node has the global view?
Section 4.1.2-------------
* Action-Oriented Response: Upon receiving the notification, routing
and load balancing mechanisms could instantly shift traffic to
backup paths or alternative DC interconnects.
That could also cause more congestion elsewhere in the network? If multiple
nodes get the fast notification and they all decide (at around the same time)
to use e.g. some alternate paths with highest cost, those paths may also be
congested? Does that mean that rerouting needs to be triggered from a
centralized entity (but that means extra delay in reacting to the event)? Is
that also outside the scope of this document?
Regards,Reshad.
_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]