RE: 400G forwarding - how does it work?

ljwobker Sat, 06 Aug 2022 07:09:05 -0700

I don't think I can add much here to the FP and Trio specific questions, for 
obvious reasons... but ultimately it comes down to a set of tradeoffs where 
some of the big concerns are things like "how do I get the forwarding state I 
need back and forth to the things doing the processing work"  -- that's an 
insane level oversimplification, as a huge amount of engineering time goes into 
those choices.

I think the "revolutionary-ness" (to vocabulate a useful word?) of putting 
multiple cores or whatever onto a single package is somewhat in the eye of the 
beholder.  The vast majority of customers would never know nor care whether a 
chip on the inside was implemented as two parallel "cores" or whether it was 
just one bigger "core" that does twice the amount of work in the same time.  
But to the silicon designer, and to a somewhat lesser extent the people writing 
the forwarding and associated chip-management code, it's definitely a big big 
deal.  Also, having the ability to put two cores down on a given chip opens the 
door to eventually doing MORE than two cores, and if you really stretch your 
brain you get to where you might be able to put down "N" pipelines.

This is the story of integration: back in the day we built systems where 
everything was forwarded on a single CPU.  From a performance standpoint all we 
cared about was the clock rate and how much work was required to forward a 
packet.  Divide the second number by the first, and you get your answer.  In 
the late 90's we built systems (the 7500 for me) that were distributed, so now 
we had a bunch of CPUs on linecards running that code.  Horizontal scaling -- 
sort of.  In the early 2000's the GSR came along and now we're doing forwarding 
in hardware, which is an order or two faster, but a whole bunch of features are 
now too complex to do in hardware, so they go over the side and people have to 
adapt.  To the best of my knowledge, TCP intercept has never come back...
For a while, GSR and CRS type systems had linecards where each card had a bunch 
of chips that together built the forwarding pipeline.  You had chips for the 
L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing 
math, and chips for the fabric interfaces.  Over time, we integrated more and 
more of these things together until you (more or less) had a linecard where 
everything was done on one or two chips, instead of a half dozen or more.  Once 
we got here, the next step was to build linecards where you actually had 
multiple independent things doing forwarding -- on the ASR9k we called these 
"slices".  This again multiplies the performance you can get, but now both the 
software and the operators have to deal with the complexity of having multiple 
things running code where you used to only have one.  Now let's jump into the 
2010's where the silicon integration allows you to put down multiple cores or 
pipelines on a single chip, each of these is now (more or less) it's own 
forwarding entity.  So now you've got yet ANOTHER layer of abstraction.  If I 
can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards.
2) each of those linecards has a bunch of NPUs/ASICs
3) each of those NPUs has a bunch of cores/pipelines

And all of this stuff has to be managed and tracked by the software.  If I've 
got a system with 16 linecards, and each of those has 4 NPUs, and each of THOSE 
has 4 cores - I've got over *two hundred and fifty* separate things forwarding 
packets at the same time.  Now a lot of the info they're using is common (the 
FIB is probably the same for all these entities...) but some of it is NOT.  
There's no value in wasting memory for the encapsulation data to host XXX if I 
know that none of the ports on my given NPU/core are going to talk to that 
host, right?  So - figuring out how to manage the *state locality* becomes 
super important.  And yes, this code breaks like all code, but no one has 
figured out any better way to scale up the performance.  If you have a 
brilliant idea here that will get me the performance of 250+ things running in 
parallel but the simplicity of it looking and acting like a single thing to the 
rest of the world, please find an angel investor and we'll get phenomenally 
rich together.

--lj

-----Original Message-----
From: Saku Ytti <[email protected]> 
Sent: Saturday, August 6, 2022 1:38 AM
To: [email protected]
Cc: Jeff Tantsura <[email protected]>; NANOG <[email protected]>; Jeff 
Doyle <[email protected]>
Subject: Re: 400G forwarding - how does it work?

On Fri, 5 Aug 2022 at 20:31, <[email protected]> wrote:

Hey LJ,

> Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately 
> familiar with any of these devices, but I'm familiar with the high level 
> tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, 
> especially once you get into the second- and third-order implementation 
> details.  Your mileage will vary...   ;-)

I expect it may come to this, my question may be too specific to be answered 
without violating some NDA.

> If you have a model where one core/block does ALL of the processing, you 
> generally benefit from lower latency, simpler programming, etc.  A major 
> downside is that to do this, all of these cores have to have access to all of 
> the different memories used to forward said packet.  Conversely, if you break 
> up the processing into stages, you can only connect the FIB lookup memory to 
> the cores that are going to be doing the FIB lookup, and only connect the 
> encap memories to the cores/blocks that are doing the encapsulation work.  
> Those interconnects take up silicon space, which equates to higher cost and 
> power.

While an interesting answer, that is, the statement is, cost of giving access 
to memory for cores versus having a more complex to program pipeline of cores 
is a balanced tradeoff, I don't think it applies to my specific question, while 
may apply to generic questions. We can roughly think of FP having a similar 
amount of lines as Trio has PPEs, therefore, a similar number of cores need 
access to memory, and possibly higher number, as more than 1 core in line will 
need memory access.
So the question is more, why a lot of less performant cores, where performance 
is achieved by making pipeline, compared to fewer performant cores, where 
individual  cores will work on packet to completion, when the former has a 
similar number of core lines as latter has cores.

> Packaging two cores on a single device is beneficial in that you only 
> have one physical chip to work with instead of two.  This often 
> simplifies the board designers' job, and is often lower power than two 
> separate chips.  This starts to break down as you get to exceptionally 
> large chips as you bump into the various physical/reticle limitations 
> of how large a chip you can actually build.  With newer packaging 
> technology (2.5D chips, HBM and similar memories, chiplets down the 
> road, etc) this becomes even more complicated, but the answer to "why 
> would you put two XYZs on a package?" is that it's just cheaper and 
> lower power from a system standpoint (and often also from a pure 
> silicon standpoint...)

Thank you for this, this does confirm that benefits aren't perhaps as 
revolutionary as the presentation of thread proposed, presentation divided Trio 
evolution to 3 phases, and this multiple trios on package was presented as one 
of those big evolutions, and perhaps some other division of generations could 
have been more communicative.

> Lots and lots of Smart People Time has gone into different memory designs 
> that attempt to optimize this problem, and it's a major part of the 
> intellectual property of various chip designs.

I choose to read this as 'where a lot of innovation happens, a lot of mistakes 
happen'. Hopefully we'll figure out a good answer here soon, as the answers 
vendors are ending up with are becoming increasingly visible compromises in the 
field. I suspect a large part of this is that cloudy shops represent, if not 
disproportionate revenue, disproportionate focus and their networks tend to be 
a lot more static in config and traffic than access/SP networks. And when you 
have that quality, you can make increasingly broad assumptions, assumptions 
which don't play as well in SP networks.

--
  ++ytti

RE: 400G forwarding - how does it work?

Reply via email to