[rtgwg] Re: Looking at draft-dong-fantel-problem-statement

Adrian Farrel Sat, 20 Dec 2025 09:38:23 -0800

Hi Jie,

I realise I didn't respond to this.


Thanks for your time considering my comments. All of your responses seem
good, and I look forward to the next revision.

Best,
Adrian

-----Original Message-----
From: Dongjie (Jimmy) <[email protected]> 
Sent: 10 December 2025 09:44
To: [email protected]; [email protected]
Cc: 'rtgwg' <[email protected]>; FANTEL <[email protected]>
Subject: RE: Looking at draft-dong-fantel-problem-statement

Hi Adrian, 

Thanks a lot for your interest and detailed review. Yes we would like to
consolidate the problem statement of FANTEL and reach some consensus in RTG,
and your comments are very helpful. 

Please find some replies inline: 


> -----Original Message-----
> From: Adrian Farrel <[email protected]>
> Sent: Monday, December 8, 2025 9:07 PM
> To: [email protected]
> Cc: 'rtgwg' <[email protected]>
> Subject: Looking at draft-dong-fantel-problem-statement
> 
> Hi,
> 
> I've been reading draft-dong-fantel-problem-statement. I think there is a
lot of
> interesting stuff in Fantel, and I'd like to see this problem statement
> consolidated and pick up consensus.
> 
> My review threw up a few significant points, and a raft of editorials. I'd
be
> happy to discuss them more or see a new revision.
> 
> Best,
> Adrian
> ===
> 
> Section 4.1 is presented as an example, so the text that follows in 4.1.1
should
> be limited to the tools that apply to the example, and not a general
description
> of the tools (that material is found elsewhere in the document).
> So...
> 
> BFD
> This paragraph is all true, but it should be better focussed on the
example in
> Figure 1. Congestion notification is not really part of that example
(although it
> is true that BFD doesn't help with it). So stick
> to:
> - Speed of BFD propagation
> - Requirement to be running BFD with a very short cycle
> - Load issues this may create
> I am a little sceptical of the load concern because if the link is
ultra-high
> bandwidth (as you'd want in this example) and the number of such links is
> small (as is likely in this example), running BFD on a very short cycle is
unlikely
> to cause a load issue.
> 
> ECN
> This paragraph doesn't apply to the example at all.

Thanks for the suggestion, we will update the descriptions of the mechanisms
used in the example.

> There is a feeling of repetition. When I got to 4.1.1 and 4.1.2 it felt
that I had
> already seen this message in section 3. I think you need to separate
things out
> and re-order the document.
> 
> Addressing the previous point will help with this.
> 
> "Why Fast Network Notification is Needed" should come first, but it should
> talk in technology-independent terms about the requirements. Then "The
> Problem with Existing Notification Mechanisms" can explain why new
> mechanisms are needed by showing the challenges with existing tools, and
> include the example. And finally "Fast Network Notifications Detailed
Problem
> Statement" can present the details.

Thanks for letting us know your feeling of repetition between section 3 and
section 4. We will reorganize the sections and the content, and make the
text following the example only about the mechanisms used in that example.

> I think section 5.4 (with some interaction with 5.2) is the place to
discuss
> multiple recipients of the same notification. This is a classic problem
even in
> old alarm-based systems, and mechanisms are either designed into the error
> reporting (such as APS) or are coordinated by the error processing system
> (such as through an alarm management system).
> 
> However, I think you are targeting a different environment. That is, you
are not
> limited to the specific and simple topologies that are consistent with
APS, you
> are not looking only to achieve end-to-end protection switching, and you
> envisage propagating the notifications to quite a number of recipients.
> 
> The challenge becomes, what happens if multiple nodes all react to the
> notification, making changes to traffic flows, and interacting with the
network
> in different ways?
> 
> I think this either needs coordination ("central control" at the level of
> SDN) or careful pre-planning.
> 
> Probably, this section does not need to fully resolve this question, but
it should
> be raised as an issue so that the solutions work will take it into account

[Jie] This is a good point. Some notifications may only be for one
particular recipient, while some could be send to multiple recipients. Also
one recipient may receive multiple notifications from different senders (so
that it has more information about the current network status).

[Jie] Section 5.2 mentions that the mechanism to determine the range of
recipients needs to be considered. And I agree that section 5.4 could also
mention that depending on the range and number of recipients, an action may
be taken by one or multiple recipients. The sender of notification needs to
take this into consideration.

> Section 8 is good as far as it goes. I think you can add to it by thinking
in a
> paranoid way!
> 
> - Could the notifications reveal information about the network that is
>   intended to be private but now made visible to external snooping?
>   - Possibly by inspecting notifications
>   - Possibly by registering as a consumer of the notifications
> - Could an attacker (or a misconfiguration) cause the reporting system
>   to becomes overwhelmed (perhaps by making it look like notifications
>   should be sent everywhere, or by flapping a resource) to the extent
>   that important notifications are lost, or the ability to control the
>   system is broken?

[Jie] Agreed with these security considerations. this is especially
important for inter-DC/WAN scenarios IMO. 


> == Editorial (obviously, non-blocking) ==
> 
> Abstract
> 
> Nothing wrong here, but I'd swap some things around to make it clear why
AI
> training and real-time services need these features.
> 
>    Modern networks require adaptive traffic manipulation including
>    Traffic Engineering (TE), load balancing, flow control, and
>    protection, to support high-throughput, low-latency, and lossless
>    applications such as AI training and real-time services.
> 
>    A good and timely understanding of network operational status, such
>    as congestion and failures, can help to improve network utilization,
>    enable the selection of paths with reduced latency, and enable faster
>    response to critical events.  This document describes the existing
>    problems and why the IETF may need a new set of fast network
>    notification solutions.

This looks better, many thanks for help rephrasing it. 

> 1.
> 
> OLD
>    This document summarizes the limitations of existing mechanisms that
>    prevent rapid notification and action to critical network events,
>    including link or node failures and congestion.
> NEW
>    This document summarizes the limitations of existing mechanisms that
>    prevent them being used for rapid notification of critical network
>    events, including link or node failures and congestion.
> END

Looks good, thanks for the suggestion. 

> 1.
> 
> s/In the context of this draft/In the context of this document/

OK

> 1.
> 
>    This document describes why the IETF may need a new set of fast
>    network notification related solutions to support these use cases.
> 
> I don't think the IETF has "needs" like that. Posisbly vendors need
protocol
> tools to deliver function for the operators?
> 
> I'd also say that "may need" is too weak. s/may need/needs/

OK, we will rephrase the sentence. 

> 1.1 Needs to be updated to the correct boilerplate.
> 
>    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
>    "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
> "MAY", and
>    "OPTIONAL" in this document are to be interpreted as described in
>    BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
>    capitals, as shown here.
> 
> You'll need to add a reference for RFC 8174.

We will update the boilerplate, thanks. 

> I'd add the references into Section 2. I'd also put the terms in
alphabetical
> order.
> 
>    BFD: Bidirectional Forwarding Detection [RFC5880]
> 
>    ECN: Explicit Congestion Notification [RFC3168]
> 
>    FRR: Fast Re-Route [RFC4090][RFC5714]
> 
>    IOAM: In-situ Operations, Administration, and Maintenance [RFC9197]

OK, will follow this order in next revision. 

> 3.
> 
> s/has deficiencies/have deficiencies/
> 
> s/is proposed as a/is a/

OK. 

> 3.
> 
>    There is
>    a demonstrable need for a standardized framework in IETF to define
>    these fast network notification mechanisms, requirements and
>    integration strategies.
> 
> Again. don't tell the IETF what it needs. Tell us about what implementers
and
> operators need. You might use this text in a BoF proposal, but not in a
> document that will be published as an RFC.

Makes sense to me, we will rephrase the text. 

> 3.
> 
> s/The following describes a summary/There follows a summary/

OK. 

> 3.
> 
>    *  Slow Reaction:
> 
> I wonder about this term. I think it is "Slow Dissemination:" because the
> problem is not that the node detecting the fault is slow to react.

Indeed, slow dissemination is more accurate.

> OLD
>       What is needed is a lightweight signaling method
>       that can provide real-time alerts (e.g., at the level of sub-
>       milliseconds or milliseconds) on failures, congestion, or
>       threshold breaches, enabling immediate actions (e.g., in ms to 10s
>       ms ranges) in the network layer.
> NEW
>       What is needed is a lightweight signaling method
>       that can provide real-time alerts (e.g., at the sub-millisecond
>       level or in the order of a few milliseconds) on failures,
>       congestion, or threshold breaches, enabling prompt actions (e.g.,
>       in the range of a millisecond to 10s of milliseconds) in the
>       network layer.
> END

The new text looks good, thanks. 

> 3.
> 
> s/capacity , or/capacity, or/
> s/reports but/reports, but/
> s/load-balancing, flow-control/load-balancing, flow-control,/ s/and can
> lead/and leading/

Acked.

> 3.
> 
>       The local view of network status prevents precise and globally
>       optimized decisions and adjustments.  It would be helpful to send
>       fast network notifications to upstream nodes which can perform
>       action based on the view of regional or global network conditions.
> 
> The second sentence helps to explain, but I think that the term "global
> optimization" has a wider concept and can only be achieved using
centralized
> computation and a complete view of all network conditions and traffic
flows.
> I think you are trying to do something in between.
> That is, you are trying to have a node make decisions about how to steer
traffic
> that it is responsible for, with awareness of a set of network conditions
that are
> relevant to the paths it might choose. So I would suggest
> 
> NEW
>       This local view of network status prevents precise and optimized
>       decisions and adjustments.  It would be helpful to send fast
>       network notifications to upstream nodes so that they can perform
>       action based on a wider view of network conditions.
> END

Agree that "global view" and "global optimization" may not be the case in
this document, wider view is more accurate. 

> 3.
> 
> s/(e.g. routing/(e.g., routing/
> s/(e.g. AI workloads/(e.g., AI workloads/

Acked. 

> 4.
> 
> s/In particular, failure-detection/Failure-detection/
> s/(fine-grained vs.  coarse-grained)/(fine-grained vs. coarse-grained)/
> s/Therefore, this draft/Therefore, this document/

Acked.

> 4.1
> 
> I think the terms AI, ML, and GPU need expansion on first use.

OK, will expand these terms in first use. 

> 4.1.1 BFD
> 
> OLD
>       The
>       former lacks of global network situation thus may cause congestion
>       on the backup paths, while the latter may breach strict
>       synchronization deadlines.
> NEW
>       The
>       former lacks visibility of the global network situation and thus
>       may cause congestion on the backup paths, while the latter may
>       breach strict synchronization requirements of the AI/ML
>       application.
> END

Sounds better, thanks. 

> 4.1.2
> 
> s/can be affected/might be affected/
> 
> But I wonder how the nodes adjacent to the failure know which node to
notify.
> Obviously, all adjacent nodes (except any connect only by the failed
link). But
> the whole point seems to propagate the notification further.

IMO this is related to the determination of the recipients as described in
section 5.2. The sender of the notification may use some information to
determine the possible recipients which should take action on the event. The
mechanism is something needs to be worked on. 

> 5.1
> 
> s/timely actioned/actioned in a timely manner/

OK, thanks. 

> 5.2
> 
> s/recipients:/recipient:/
> s/functional consumers:/functional consumers./ s/in the figure above/in
Figure
> 2/

OK, thanks.

> 5.2
> 
> Tables 1 and 2 have a column "Example Benefit." It is unclear "benefit of
what,
> and to whom." I think you can handle this with a little more introductory
text,
> like...
> 
>    The tables have three columns.  The fist column lists the type or
>    node or type of application/function.  The second shows the role that
>    the node or application/function is responsible for within the
>    network that could benefit from fast network notifications.  The
>    third column indicates examples of how fast notification could
>    benefit the node/application/function in filling its role.

The introduction text looks good, thanks. 

> Table 2 has...
> 
>    | Traffic Engineering   | Centralized   | Pre-compute new paths     |
>    | Element (PCE)         | optimization  | before congestion         |
>    |                       |               | propagates
> |
> 
> It is true that this is one role of the PCE, and also true that a PCE is a
> component of a "traffic engineering element." But I think that it is not
the
> primary role. Perhaps, in paragraph I suggested above, it should say that
the
> second column shows an example of the role.

OK, will make this clearer the second column shows the example of the role. 

> It looks like there is an implication in Figure 2 that notifications flow
from data
> plane to control plane to management plane to application plane.
> I hope that isn't your intention, because I don't think that is how things
work.
> 
> Maybe the figure is just a list of four catgories of notification
recipient without
> the arrows?

Agreed, the arrows can be removed to avoid possible confusion. 

> 5.2
> 
> "near-instantaneous"
> 
> Same concern as before: everything is relative, but "near instantaneous"
> is probably going to attract the wrong response from people. Maybe, "very
> quick," or even, "very, very quick."

OK, will replace it with other words to show it is "very quick". 

> 5.2
> 
> s/something needs/something that needs/

OK, thanks. 

> 5.3
> 
> s/recipient noded/recipient node/

Thanks for catching the typo. 

> 5.4
> 
> OLD
>    The possible actions to the notification can be but not
>    limit to one or multiple of the following:
> NEW
>    The possible actions in response to the notification can be, but not
>    limited, to one or more of the following:
> END
 
Looks better, thanks.

> I'm not sure Section 6 adds to the draft.

OK, we will consider to merge the content with relevant sections. 


_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[rtgwg] Re: Looking at draft-dong-fantel-problem-statement

Reply via email to