Thanks Jeff for your feedback.
I certainly agree with you about the operational complexity of such
solutions.

If there is a plan to have another AIDS side meeting we can invite the
proponents of such new solutions (e.g. Cerberus) to discuss with them their
solutions in details including its operational complexity.

It is worth noting that Prof. Torsten Hoefler gave a very interesting
keynote speech on this topic during FCRC conference last June. You can
access the recording of his talk here:
https://www.youtube.com/watch?v=85p34eBTBWo

Torsten talk and the two IEEE Computer papers he mentioned at the end of
the talk are really inspiring and insightful.

I strongly recommend to invite Torsten in the next AIDS meeting. He is a
very busy Prof, so if there is a plan to have another AIDS, you better book
his time ASAP.

Thanks
Hesham

On Sun, Aug 13, 2023, 7:30 PM Jeff Tantsura <jefftant.i...@gmail.com> wrote:

> Hesham,
>
> Couple of points:
>
> 1. Most of proponents of new solutions have never actually built/operated
> a ML cluster, operational complexity is vastly underestimated (if
> considered at all).
>
> 2. Given the focus on highest possible performance - in most cases, ML
> clusters are stand-alone fabrics with physically separated North/South
> (communication with outside world)and East/West (GPU-GPU) networks, and our
> focus is specifically on the GPU fabrics.
> While the traffic patterns are different, the traffic itself is highly
> homogenous, for IP it would be RoCEv2 (RDMAoUDP), so you don’ need to solve
> “TCP issues”. The traffic pattern are very well understood and usually
> heavily orchestrated.
> 3. They all heavily rely on a notion of a flow, which is not necessarily
> the correct approach, don’t forget - RDMA (in most operations) is a memory
> block transfer, and within reasonable limits, packets can be placed
> directly into memory independently of the order they have arrived.
>
> On Aug 13, 2023, at 10:44 AM, Hesham ElBakoury <helbako...@gmail.com>
> wrote:
>
> In Sigmetrics 2022 the paper "*Cerberus: The Power of Choices in
> Datacenter Topology Design*" was published [
> https://people.csail.mit.edu/ghobadi/papers/cerberus_sigmetrics_2022.pdf].
>
> "This paper uncovered a potential of serving datacenter traffic with the
> switch technology that best matches its structure. By tapping into this
> potential, we developed a solution, Cerberus, which we showed to
> significantly improve throughput"
>
> Perhaps Cerberus is best suited for ML.
>
> Comments?
>
> Hesham
>
> On Fri, Aug 11, 2023, 2:56 AM Hesham ElBakoury <helbako...@gmail.com>
> wrote:
>
>> The need for new DC architecture has been the subject of research for
>> quite sometime.
>>
>> I recall that Prof. Brighten (UCIC) had a paper on spineless DC in HotNet
>> 2021 [Spineless Data Centers (acm.org)
>> <https://dl.acm.org/doi/pdf/10.1145/3422604.3425945>]. He presented it
>> in ONUG research track in 2021 [the recording is available here:
>>
>> Spinelessness: The Future of Data Center Networks? - ONUG | ONUG
>> <https://onug.net/events/spinelessness-the-future-of-data-center-networks/>
>>
>> The abstract of the ONUG talk:
>>
>> "*Leaf-spine and Clos network topologies have become ubiquitous in
>> modern data centers to achieve high throughput for data-intensive
>> applications.  In fact, such designs are not optimal: recent research has
>> developed other topologies, specifically expander graphs, that achieve
>> higher throughput or lower cost, along with potentially easier incremental
>> expansion.  In this talk we’ll explore whether this theoretical performance
>> efficiency can be realized in a practical way to improve enterprise
>> leaf-spine data centers.  This leads to a “spineless” data center, with a
>> single type of switch rather than having separate roles for leafs and
>> spines.  We find that such designs can indeed be more efficient even at
>> small to moderate scale, and we introduce an efficient routing scheme for
>> such networks that uses standard hardware and protocols.  This line of work
>> opens new research directions in topology and routing design that can have
>> significant impact for the most common data centers*."
>>
>> Hesham
>> On 8/9/2023 9:32 PM, Yingzhen Qu wrote:
>>
>> Hi Yao,Spinelessness: The Future of Data Center Networks? - ONUG | ONUG
>> <https://onug.net/events/spinelessness-the-future-of-data-center-networks/>
>>
>> We used the webex provided by IETF during the side meeting, and the Webex
>> on Jeff's laptop crashed at the end of the meeting. We were told that the
>> recording might be on the IETF chromebook, but haven't heard anything yet.
>>
>> Meanwhile you can have access to all the slides including the ones that
>> didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This
>> repository is for all the meeting materials. (github.com)
>> <https://github.com/Yingzhen-ietf/AIDC-IETF117>
>>
>> Thanks,
>> Yingzhen
>>
>> On Tue, Aug 8, 2023 at 6:08 PM <liu.ya...@zte.com.cn> wrote:
>>
>>> Hi Yingzhen,
>>>
>>>
>>> Do we have any recording for this meeting?
>>>
>>>
>>> Thanks,
>>>
>>> Yao
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> rtgwg mailing listrtgwg@ietf.orghttps://www.ietf.org/mailman/listinfo/rtgwg
>>
>> _______________________________________________
> rtgwg mailing list
> rtgwg@ietf.org
> https://www.ietf.org/mailman/listinfo/rtgwg
>
>
>
_______________________________________________
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg

Reply via email to