Hesham, Couple of points:
1. Most of proponents of new solutions have never actually built/operated a ML cluster, operational complexity is vastly underestimated (if considered at all). 2. Given the focus on highest possible performance - in most cases, ML clusters are stand-alone fabrics with physically separated North/South (communication with outside world)and East/West (GPU-GPU) networks, and our focus is specifically on the GPU fabrics. While the traffic patterns are different, the traffic itself is highly homogenous, for IP it would be RoCEv2 (RDMAoUDP), so you don’ need to solve “TCP issues”. The traffic pattern are very well understood and usually heavily orchestrated. 3. They all heavily rely on a notion of a flow, which is not necessarily the correct approach, don’t forget - RDMA (in most operations) is a memory block transfer, and within reasonable limits, packets can be placed directly into memory independently of the order they have arrived. > On Aug 13, 2023, at 10:44 AM, Hesham ElBakoury <[email protected]> wrote: > > In Sigmetrics 2022 the paper "Cerberus: The Power of Choices in Datacenter > Topology Design" was published > [https://people.csail.mit.edu/ghobadi/papers/cerberus_sigmetrics_2022.pdf]. > > "This paper uncovered a potential of serving datacenter traffic with the > switch technology that best matches its structure. By tapping into this > potential, we developed a solution, Cerberus, which we showed to > significantly improve throughput" > > Perhaps Cerberus is best suited for ML. > > Comments? > > Hesham > > On Fri, Aug 11, 2023, 2:56 AM Hesham ElBakoury <[email protected] > <mailto:[email protected]>> wrote: >> The need for new DC architecture has been the subject of research for quite >> sometime. >> >> I recall that Prof. Brighten (UCIC) had a paper on spineless DC in HotNet >> 2021 [Spineless Data Centers (acm.org) >> <https://dl.acm.org/doi/pdf/10.1145/3422604.3425945>]. He presented it in >> ONUG research track in 2021 [the recording is available here: >> >> Spinelessness: The Future of Data Center Networks? - ONUG | ONUG >> <https://onug.net/events/spinelessness-the-future-of-data-center-networks/> >> The abstract of the ONUG talk: >> >> "Leaf-spine and Clos network topologies have become ubiquitous in modern >> data centers to achieve high throughput for data-intensive applications. In >> fact, such designs are not optimal: recent research has developed other >> topologies, specifically expander graphs, that achieve higher throughput or >> lower cost, along with potentially easier incremental expansion. In this >> talk we’ll explore whether this theoretical performance efficiency can be >> realized in a practical way to improve enterprise leaf-spine data centers. >> This leads to a “spineless” data center, with a single type of switch rather >> than having separate roles for leafs and spines. We find that such designs >> can indeed be more efficient even at small to moderate scale, and we >> introduce an efficient routing scheme for such networks that uses standard >> hardware and protocols. This line of work opens new research directions in >> topology and routing design that can have significant impact for the most >> common data centers." >> >> Hesham >> >> On 8/9/2023 9:32 PM, Yingzhen Qu wrote: >>> Hi Yao,Spinelessness: The Future of Data Center Networks? - ONUG | ONUG >>> <https://onug.net/events/spinelessness-the-future-of-data-center-networks/> >>> >>> We used the webex provided by IETF during the side meeting, and the Webex >>> on Jeff's laptop crashed at the end of the meeting. We were told that the >>> recording might be on the IETF chromebook, but haven't heard anything yet. >>> >>> Meanwhile you can have access to all the slides including the ones that >>> didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This repository >>> is for all the meeting materials. (github.com) >>> <https://github.com/Yingzhen-ietf/AIDC-IETF117> >>> >>> Thanks, >>> Yingzhen >>> >>> On Tue, Aug 8, 2023 at 6:08 PM <[email protected] >>> <mailto:[email protected]>> wrote: >>>> Hi Yingzhen, >>>> >>>> >>>> >>>> Do we have any recording for this meeting? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Yao >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> _______________________________________________ >>> rtgwg mailing list >>> [email protected] <mailto:[email protected]> >>> https://www.ietf.org/mailman/listinfo/rtgwg > _______________________________________________ > rtgwg mailing list > [email protected] > https://www.ietf.org/mailman/listinfo/rtgwg
_______________________________________________ rtgwg mailing list [email protected] https://www.ietf.org/mailman/listinfo/rtgwg
