Hi Robert, Thanks for your comments.
The TE approach has been adopted by several hyper-scale OTT companies in their AI networks. However, this semi-automated approach is not optimal enough, especially for AI cloud scenarios where multiple tenants are using the same AI-related IaaS resources. Otherwise, it seems meaningless for so many hyperscalers including those adopting TE for AI networks to launch the UEC consortium (see https://www.nextplatform.com/2023/07/20/ethernet-consortium-shoots-for-1-million-node-clusters-that-beat-infiniband/). As for how to propagate link capacity and even available capacity information, it’s preferred to leverage the defined OSPF or ISIS TE metrics or extended TE metrics if possible. Best regards, Xiaohu 发件人: Robert Raszuk <[email protected]> 日期: 星期五, 2023年11月24日 01:42 收件人: [email protected] <[email protected]> 抄送: [email protected] <[email protected]> 主题: Re: [Lsr] 转发: New Version Notification for draft-xu-lsr-fare-00.txt Hi Xiaohu, I would be interested to learn why described elephant flows couldn't be instead handled in typical TE fashion (choose your TE flavor) since we know the target AI training clusters at the CLOS edges ? Such TE could steer such flows link by link or inject it into disjoied virtual topologies. Said this I am quite sceptical when it comes to injection of lot's of dynamic application performance data into any routing protocol. Yes it has been done in the past see rfc7810 and you do need to explain why essentially similar additional data of such kind is needed. Kind regards, Robert On Thu, Nov 23, 2023 at 5:27 PM [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> wrote: Hi all, Any comments or suggestions are welcome. Best regards, Xiaohu 发件人: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> 日期: 星期五, 2023年11月24日 00:13 收件人: Xiaohu Xu <[email protected]<mailto:[email protected]>> 主题: New Version Notification for draft-xu-lsr-fare-00.txt A new version of Internet-Draft draft-xu-lsr-fare-00.txt has been successfully submitted by Xiaohu Xu and posted to the IETF repository. Name: draft-xu-lsr-fare Revision: 00 Title: Fully Adaptive Routing Ethernet Date: 2023-11-22 Group: Individual Submission Pages: 7 URL: https://www.ietf.org/archive/id/draft-xu-lsr-fare-00.txt Status: https://datatracker.ietf.org/doc/draft-xu-lsr-fare/ HTMLized: https://datatracker.ietf.org/doc/html/draft-xu-lsr-fare Abstract: Large language models (LLMs) like ChatGPT have become increasingly popular in recent years due to their impressive performance in various natural language processing tasks. These models are built by training deep neural networks on massive amounts of text data, often consisting of billions or even trillions of parameters. However, the training process for these models can be extremely resource- intensive, requiring the deployment of thousands or even tens of thousands of GPUs in a single AI training cluster. Therefore, three- stage or even five-stage CLOS networks are commonly adopted for AI networks. The non-blocking nature of the network become increasingly critical for large-scale AI models. Therefore, adaptive routing is necessary to dynamically load balance traffic to the same destination over multiple ECMP paths, based on network capacity and even congestion information along those paths. The IETF Secretariat _______________________________________________ Lsr mailing list [email protected]<mailto:[email protected]> https://www.ietf.org/mailman/listinfo/lsr
_______________________________________________ Lsr mailing list [email protected] https://www.ietf.org/mailman/listinfo/lsr
