On Tue, Jan 13, 2026 at 10:22 AM Haoyu Song <[email protected]>
wrote:
>
>
>
> -----Original Message-----
> From: Tom Herbert <[email protected]>
> Sent: Monday, January 12, 2026 6:13 PM
> To: Haoyu Song <[email protected]>
> Cc: dave seddon <[email protected]>; int-area <[email protected]>
> Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header
(SUNH)
>
> On Mon, Jan 12, 2026 at 4:54 PM Haoyu Song <[email protected]>
wrote:
> >
> >
> >
> > inline
> >
> > From: Tom Herbert <[email protected]>
> > Sent: Monday, January 12, 2026 4:31 PM
> > To: Haoyu Song <[email protected]>
> > Cc: dave seddon <[email protected]>; int-area
> > <[email protected]>
> > Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network
> > Header (SUNH)
> >
> >
> >
> >
> >
> > On Mon, Jan 12, 2026, 3:47 PM Haoyu Song <[email protected]>
wrote:
> >
> > Hi Tom,
> >
> > Please see my response inline.
> >
> > Best regards,
> > Haoyu
> >
> > -----Original Message-----
> > From: Tom Herbert <[email protected]>
> > Sent: Saturday, January 10, 2026 7:55 AM
> > To: Haoyu Song <[email protected]>
> > Cc: dave seddon <[email protected]>; [email protected]
> > Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network
> > Header (SUNH)
> >
> > On Fri, Jan 9, 2026 at 4:15 PM Haoyu Song <[email protected]>
wrote:
> > >
> > > Hi Dave,
> > >
> > > Thank you for the comments. We are on the same page that a compact
header with just enough address bits is critical in AI DCN (I would argue
this also applies to the scale-out networks).
> > >
> > > I want to further discuss two points:
> >
> > Hi Haoyu, thanks for the discussion!
> >
> > >
> > > 1. The variable size address isn't that "scary" actually. We have
verified the scheme with P4 and it's doable. Once it's realized in switch
ASIC, there's no performance implications at all.
> >
> > The size of addresses in lookups will have performance implications and
cost effects as well. For instance, with 16 bit addresses a switch could do
route lookup with a simple array lookup in SRAM, for 32 bit addresses we
need a CAM or TCAM, for 128 bit addresses we need a CAM or TCAM 4x the size
of the one for IPv4.
> >
> > [HS] The hierarchical addressing scheme never lookups full addresses.
> > At each level, it only searches the prefix assigned to the level. For
> > example, each cluster has 1K nodes and we have 1K clusters in a DC.
> > The lowest level node in a cluster has a 10bit address and 10bit
> > prefix. In a cluster, the nodes only uses 10bit addresses. If it needs
> > to talk to another node in another cluster, its address needs to be
> > augmented with the 10bit prefix (but the prefix is only stored in the
> > gateway switch, which is oblivious to the nodes in a cluster).
> > Finally, the data center gateway switch holds a 108bit prefixes, which
> > can be used to augment all the addresses in the data center to 128bit
> > IPv6 addresses. A small TCAM or a small direct index table for lookups
> > is enough at each level. (details can be found at
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > researchgate.net%2Fprofile%2FHaoyu-Song%2Fpublication%2F347085487_Adap
> > tive_Addresses_for_Next_Generation_IP_Protocol_in_Hierarchical_Network
> > s%2Flinks%2F6070858da6fdcc5f77948ec2%2FAdaptive-Addresses-for-Next-Gen
> > eration-IP-Protocol-in-Hierarchical-Networks.pdf%3Forigin%3Dpublicatio
> > n_detail%26_tp%3DeyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InByb2ZpbGUiLCJwYWdlI
> > joicHVibGljYXRpb25Eb3dubG9hZCIsInByZXZpb3VzUGFnZSI6InB1YmxpY2F0aW9uIn1
> > 9%26__cf_chl_tk%3DSIPGEnoIS.7WtTH63F1auQRyVbZzJP7AprlQYXN7wGE-17682608
> > 09-1.0.1.1-lNehHWVj1D3bK0gRlaT4qD4Bw5rgC8_sVglRL1DD73A&data=05%7C02%7C
> > haoyu.song%40futurewei.com%7C09cc54d178ed4b78de1708de524945c8%7C0fee8f
> > f2a3b240189c753a1d5591fedc%7C1%7C1%7C639038671861451114%7CUnknown%7CTW
> > FpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIs
> > IkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=pPTcSq6PEXIMOGqstt
> > x8vu4J7bSIEgzTz5o6CWHcV2I%3D&reserved=0)
> >
> >
> > > On the other hand, supporting different lengths have many advantages:
it can scale with the cluster size without any waste,  it supports
communication between clusters with different sizes, it doesn't  need to
respin the chips in case the network scale changes, and the same standard
would be applied to any scenarios as laid out in our paper "Adaptive
Addresses for Next Generation IP Protocol in Hierarchical
Networks"(ICNP2020). Of course, there's a tradeoff on how fine the address
length step should be supported (e.g., 1 bit, 2 bits, 4 bits, or 8 bits).
This is subject to further study.
> >
> > Waste is relative and we get diminishing returns in both directions.
> > For instance, if we halve an IPv6 address then we save sixteen bytes
> > per packet. That's significant. But if we halve the sixteen bit
> > addresses of SUNH we'd save a whole two bytes per packet and that's
> > nothing to write home about. On the other hand, suppose we double the
> > sixteen bit addresses of SUNH then we have addresses the same size as
> > IPv4 addresses. Grant it, IPv4 header has some other stuff in the
header, but at some point it starts to be a question of why not just use
IPv4?
> >
> > [HS] The benefits of the hierarchical addressing are two folds:
flexibility and the compatibility to IPv6. We make each node IPv6
accessible but internally, it avoids the IPv6 header overhead. We see that
in AI data center, the supernode size ranges from 8 to 1K. I don't know how
large the size can become in the future. A fine granularity can minimize
the waste yet be ready to scale to any size.
> >
> >
> >
> > Haoyu,
> >
> >
> >
> > As I said, you get diminishing returns in finer granularity. For
> > instance, if you halve sixteen bits addresses then the savings is on
> > two bytes per packet. That's not even 1% of a 256 byte packet. I think
> > it's going to be hard to justify those miniscule savings against the
> > complexity of supporting variable length addresses and headers
> >
> >
> >
> > [HS] I think we can assume the minimum packet size is 64 byte. And I’m
not arguing that we must have very fine length granularity. That can be
determined as a tradeoff. The reason to support variable length address is
to support inter-cluster communication while maintaining the header
efficiency.
> >
> >
> >
> >
> >
> >
> >
> > As for the granularity, my strong preference is to first keep addresses
in units of eight bits. It's unpleasant for a lot of processors to deal
with anything smaller and is a long held convention in IP addresses, port
numbers, and Ethernet addresses. I'd also prefer maintaining four byte
alignment of the transport layer like IPv4 and IPv6, but I suppose
alignment is mostly historical at this time so maybe it's not super
critical.
> >
> > With all this in mind, if there were to be another address size in
addition to 16 bits, I might opt for 24 bits. It breaks four byte
alignment, but has the nice property that it maps 10/8 IPv4 addresses.
> > SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4
header. I suppose that might be worth it.
> >
> > >
> > > 2. If we assume the scale up network would take ethernet as the L2
technology, it can be envisioned that the scale up and scale out network
would eventually converge into a single network. Then we would consider
that the L3 should also have a common standard (strictly speaking, if we
only have a separate scale up network, we don't need L3 at all, because an
L2 fabirc is enough).
> >
> > Strictly speaking, yes, But people also want network layer
functionality like TOS and Hop Limits so L3 enters the picture and we see
people go down the path to reinventing L3 like AHF does.
> >
> > [HS] Up to now most scale up networks are like full mesh point to point
fabric. If L3 are needed, and Ethernet is used, I think it makes sense to
converge the scale out and scale up network into one. Then we may want to
see any node IPv6 reachable but within the data center, we don't want the
IPv6 overhead. The hierarchical address provides a simple solution.
> >
> > > Thus, the variable size address can support a hierarchical network
naturally mapping to the DCN topology and more important, it allows the
seamlessly connecting with the Internet which runs IPv4/IPv6 so the
inter-DC communication can be supported without any modification to the
public network. I think this is a reason we need an IP-like L3 header which
can translate into IPv4/v6. Note the SUNH proposal support this already,
the only issue is the 16-bit address is an overkill to the current cluster
size, and the fixed length is not flexible.
> >
> > Like I said, adding different address sizes could just be a matter of
getting different EtherTypes for SUNH. But, I would only want to add
support for more address sizes sparingly.
> >
> > [HS] Using the EtherType based solution, you will have a fixed header
which can only be used for intra-cluster communication. Using hierarchical
addresses, a node A in cluster X can uses the same protocol header to
communicate with a node B in cluster Y. In this case, the source address
and the destination address are different in length because the destination
address needs to include B's cluster prefix.
> >
> >
> >
> > The different EtherTypes allow for different sized addresses. I can
imagine at most we ever need four sizes: 1, 2, 3, or 4 byte addresses.
Anything bigger just use IPv6, any odd number of bits just round up to the
nearest byte size. If nodes in two clusters want to talk then a gateway can
map addresses from one cluster to another, which is what anyone would need
to do when connecting domains with different address spaces.
> >
> >
> >
> > [HS] hmmm…how is that achieved? Assume we have two clusters and each
cluster uses 1 byte address (i.e., each cluster can have up to 256 nodes).
Now node 0 in cluster A wants to send a packet to node 0 in cluster B. How
does the packet header look like in this case?
>
> Just use 16 bit addresses. Cluster A's addresses are 0x0-0xff and cluster
B's addresses are 0x100-0x1ff. Packet header is just a SUNH header.
>
> Tom
>
> [HS] There are potentially two issues: 1. The whole DC scale is limited
by the 16 bit address (up to 64K nodes), which may not be sufficient.

Hi Haoyu,

It's not the whole DC, a DC can have multiple clusters. For intra cluster
communications compressed headers and addresses can be used, for inter
cluster IPv4 or IPv6 can be used.

> 2. For intra-cluster communication, even though 8 bits are enough, 16bit
full address are still needed to be used.

As I said, the savings of just two bytes per packet doesn't justify the
complexity for supporting variable size addresses.

> I think the core issue is if the switch hardware support for the flexible
addressing is simple and efficient. If that's true, the flexibility can
handle all future scenarios without needing further changes of the protocol
header.

That is the core issue. Flexibility and switch hardware is an oxymoron. All
the hardware engineers I've ever met have endlessly lamented variable
length headers. While flexibility is great on paper, asking hardware
vendors to support every possible address size in their implementation is
going to get significant pushback. Consider that it was a huge effort to
even get support for a second address size when IPv6 came along.

Tom


>
> >
> >
> >
> > Tom
> >
> >
> >
> >
> >
> > Tom
> > >
> > >
> > > Best regards,
> > > Haoyu
> > >
> > > -----Original Message-----
> > > From: dave seddon <[email protected]>
> > > Sent: Friday, January 9, 2026 2:03 PM
> > > To: [email protected]
> > > Subject: [Int-area] Regarding the draft: Scale-Up Network Header
> > > (SUNH)
> > >
> > > G'day Tom and Haoyu,
> > >
> > > I'm trying to join the discussion about "draft: Scale-Up Network
> > > Header (SUNH)", but I just joined the mail list, so I don't know if
> > > posting to the subject line will do it.  ( Apologies if this breaks
> > > threading )
> > >
> > > Drafts:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fda
> > > ta%2F&data=05%7C02%7Chaoyu.song%40futurewei.com%7C09cc54d178ed4b78de
> > > 1708de524945c8%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C63903867
> > > 1861499659%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
> > > jAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7
> > > C%7C&sdata=LmZn6RoLxyN%2FTiZUvYC5%2BnzTQa86JiQHkfK6%2FGoIQAQ%3D&rese
> > > rved=0
> > > tracker.ietf.org%2Fdoc%2Fdraft-herbert-sunh%2F&data=05%7C02%7Chaoyu.
> > > so
> > > ng%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff2a3b2
> > > 40
> > > 189c753a1d5591fedc%7C1%7C1%7C639036573084186972%7CUnknown%7CTWFpbGZs
> > > b3
> > > d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIj
> > > oi
> > > TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BNQeccfibgnpwtlX0mjTdV
> > > Fp
> > > ILI7xZlFP6Qh6KTuNHE%3D&reserved=0
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fda
> > > ta%2F&data=05%7C02%7Chaoyu.song%40futurewei.com%7C09cc54d178ed4b78de
> > > 1708de524945c8%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C63903867
> > > 1861540575%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
> > > jAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7
> > > C%7C&sdata=5O%2Be%2BHpktTAuVEKJj%2BGhM3aQ%2F1MN0sUMWThtudzxLc4%3D&re
> > > served=0
> > > tracker.ietf.org%2Fdoc%2Fhtml%2Fdraft-song-ship-edge-05&data=05%7C02
> > > %7
> > > Chaoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fe
> > > e8
> > > ff2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084212755%7CUnknown%7
> > > CT
> > > WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zM
> > > iI
> > > sIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=O6TigIDyIICJv%2
> > > F%
> > > 2Fabg49jSaZlz7%2B1aKKYyVc3elDI5U%3D&reserved=0
> > >
> > > It seems like the discussion centers on the address length.
> > >
> > > The SUNH "1.1.  Problem statement" is very clear "
> > > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header
would be about 16% overhead "
> > >
> > > Absolutely minimizing overhead makes sense currently, but for how
long do we expect this to be true?  Tom, since you've been talking to
people who run the largest AI clusters in the world, you expect this to
hold true for the foreseeable future.
> > >
> > >
> > > Tom - I wonder if draft-herbert-sunh would benefit from a small
summary, maybe with a table, that compares the proposed addressing to other
protocols that are common within data centers?
> > >
> > > For example, comparing protocols by their header, address lengths,
and "overhead"
> > > - PCIe ( IEEE have paywalls, so it's hard to find a good source.
> > > Maybe this:
> > >
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww%2F&data=05%7C02%7Chaoyu.song%40futurewei.com%7C09cc54d178ed4b78de1708de524945c8%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C639038671861578756%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2FumGronJl3ryBrs29O0IT9uenvAJuKnSEhrptu9ZOhg%3D&reserved=0
.
> > > pearsonhighered.com%2Fassets%2Fsamplechapter%2F0%2F3%2F2%2F1%2F03211
> > > 56
> > > 307.pdf&data=05%7C02%7Chaoyu.song%40futurewei.com%7C761f7d9f68e44860
> > > 48
> > > 5008de50609d54%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C63903657
> > > 30
> > > 84230845%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjA
> > > uM
> > > DAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C
> > > &s
> > > data=N5UpkmqCvA8hpe7ou2p%2B4cTeV6SZhS5C%2B6ZJZiWyuHQ%3D&reserved=0
> > > )
> > > - Infiniband ( addressing scheme found here on page 625
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhj
> > > em%2F&data=05%7C02%7Chaoyu.song%40futurewei.com%7C09cc54d178ed4b78de
> > > 1708de524945c8%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C63903867
> > > 1861620167%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
> > > jAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7
> > > C%7C&sdata=xQ5twg%2FyzFCR991vTCiG%2FN6cmQNpygMvCiTVYGPPG9M%3D&reserv
> > > ed=0
> > > mesider.diku.dk%2F~vinter%2FCC%2FInfinibandchap42.pdf&data=05%7C02%7
> > > Ch
> > > aoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8
> > > ff
> > > 2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084453477%7CUnknown%7CT
> > > WF
> > > pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiI
> > > sI
> > > kFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=0qEGPX7uWcVMChB7Y
> > > uR
> > > 4WdGX1Cdxea9BCUCqfArnpJA%3D&reserved=0 )
> > > - Ethernet
> > > - Ethernet with 802.1q ( and qnq )
> > > - IPv4
> > > - IPv6
> > > - SUNH
> > > ...
> > >
> > > Now that the context is established, explain why 16 bits were chosen
for the source/destination address.  I guess, but it's not in the document;
You were considering the number of hosts in the domain.
> > >
> > > Nit pick (sorry). "care must be taken to ensure the minimum packet
size is maintained".  Might help to explain why.
> > >
> > > Re section "TCP and UDP in SUNH".  I remember recently Stuart from
Apple saying something pretty interesting about UDP: "If IP had port
numbers, you wouldn't really need a UDP header at all."
> > >
> > > Multicast?  It might be worth mentioning multicast and explaining why
it isn't discussed.  e.g. No requirement for this, or it might be
considered in the future if a need arises.
> > >
> > >
> > >
> > > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical
addressing stuff:
> > > a)
> > > This reminds me of good old fiber channel addressing, and I suppose
the more modern Infiniband/RDMA.
> > > b)
> > > The words "variable length" are scary because variability clearly
isn't ideal for hardware.  I guess when you say "variable length" you don't
actually mean the addresses would vary dynamically, but that there could be
a range of set fixed length addressing that could be selected for different
deployment scenarios?
> > > c)
> > > One core concept of draft-song-ship-edge-05, is that traffic destined
for IoT devices needs a long, unique address, while the traffic _sourced_
from these devices towards the data center can have a much smaller
destination address.
> > > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he
commented that because of the pervasive use of anycast by a relatively
small number of CDNs, that the Internet might only need a /24 worth of
addresses for 99% of all traffic.
> > > Other network protocols with asymmetric addresses include:
> > > - PCIe (Requester vs Completer addressing)
> > > - In InfiniBand / RDMA, requests carry full destination addressing
(QPN + LID/GID + path), while responses omit it and are routed implicitly
using the established queue-pair and path state, making the addressing
directionally asymmetric.
> > > - QUIC has explicit directional asymmetry in connection IDs
> > >
> > >
> > > --
> > > Regards,
> > > Dave Seddon
> > >
> > > _______________________________________________
> > > Int-area mailing list -- [email protected] To unsubscribe send an
> > > email to [email protected]
> > >
> > > _______________________________________________
> > > Int-area mailing list -- [email protected] To unsubscribe send an
> > > email to [email protected]
_______________________________________________
Int-area mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to