Ketan,

Many thanks for your decisive leadership in this very important problem. 
(Thanks too to the folks that brought this to the IETF.) The first thing I’d 
like to say is that, it is high time the IETF tackles the broader issues of 
building efficient, resilient, responsive, and performant HPC Clusters, and in 
particular, AI (ML) Clusters. These are multi-hop networks that operate at 
Layer 3, but many current attempts to solve this approach it at Layer 2, as an 
Ethernet problem. Some of the problems may indeed be fixable at the Ethernet 
Layer, but many quite crucial problems require Layer 3 solutions.

I would like to see the IETF look at the problem more holistically, from an 
end-to-end point of view (workflow-wise as well as traffic-wise). I have a 
contribution to the RTGWG to get a higher level conversation going. I 
understand this problem will be tackled in RTGWG until we have a better sense 
of the charter, and the IETF may then propose a WG-forming BoF if appropriate.

My contribution suggests that in parallel to tackling the problem of congestion 
detection and notification, the IETF also focus on reducing the probability 
that congestion occurs in the first place; similarly for network failures. The 
approach proposes scheduling network resources in conjunction with compute 
resources; the network resources hopefully will reduce chances of congestion, 
but can also put in place protection paths should a node or link fail.  

Pavan Beeram will be presenting the elements of this idea in RTGWG. The hope is 
that the ensuing discussion will help the group come up with a holistic 
solution to optimizing network resources (in backend networks, in DCI networks 
and in the WAN) for ML workloads.

Cheers,
Kireeti.
_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to