As Lincoln said - all of us directly working with BCM/other silicon vendors 
have signed numerous NDAs.
However if you ask a well crafted question - there’s always a way to talk about 
it ;-)

In general, if we look at the whole spectrum, on one side there’re massively 
parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as 
the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC 
(Spider).
On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at 
its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, 
Barefoot(quite different animal wrt programmability), etc - usually shallow on 
chip buffer only (100-200M).

In between we have got so called programmable pipeline silicon, BCM DNX and 
Juniper Express are in this category, usually a combo of OCB + off chip memory 
(most often HBM), (2-6G), usually have line-rate/high scale security/overlay 
encap/decap capabilities. Usually have highly optimized RTC blocks within a 
pipeline (RTC within macro). The way and speed to access DBs, memories is 
evolving with each generation, number/speed of non networking cores(usually 
ARM)  keeps growing - OAM, INT, local optimizations are primary users of it.

Cheers,
Jeff

> On Jul 25, 2022, at 15:59, Lincoln Dale <l...@interlink.com.au> wrote:
> 
> 
>> On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+na...@gmail.com> 
>> wrote:
> 
>> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwob...@gmail.com> wrote:
>> > This is the parallelism part.  I can take multiple instances of these 
>> > memory/logic pipelines, and run them in parallel to increase the 
>> > throughput.
>> ...
>> > I work on/with a chip that can forwarding about 10B packets per second… so 
>> > if we go back to the order-of-magnitude number that I’m doing about “tens” 
>> > of memory lookups for every one of those packets, we’re talking about 
>> > something like a hundred BILLION total memory lookups… and since memory 
>> > does NOT give me answers in 1 picoseconds… we get back to pipelining and 
>> > parallelism.
>> 
>> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
>> my J2 example :)
> 
> I suspect many folks know the exact answer for J2, but it's likely under NDA 
> to talk about said specific answer for a given thing.
> 
> Without being platform or device-specific, the core clock rate of many 
> network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a 
> goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that 
> doesn't mean a latency of 1 clock ingress-to-egress but rather that every 
> clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS 
> packet rate is achieved by having enough pipelines in parallel to achieve 
> that.
> The number here is often "1" or "0.5" so you can work the number backwards. 
> (e.g. it emits a packet every clock, or every 2nd clock).
> 
> It's possible to build an ASIC/NPU to run a faster clock rate, but gets back 
> to what I'm hand-waving describing as "goldilocks". Look up power vs 
> frequency and you'll see its non-linear.
> Just as CPUs can scale by adding more cores (vs increasing frequency), ~same 
> holds true on network silicon, and you can go wider, multiple pipelines. But 
> its not 10K parallel slices, there's some parallel parts, but there are 
> multiple 'stages' on each doing different things.
> 
> Using your CPU comparison, there are some analogies here that do work:
>  - you have multiple cpu cores that can do things in parallel -- analogous to 
> pipelines
>  - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some 
> DRAM or LLC)  -- maybe some lookup engines, or centralized buffer/memory
>  - most modern CPUs are out-of-order execution, where under-the-covers, a 
> cache-miss or DRAM fetch has a disproportionate hit on performance, so its 
> hidden away from you as much as possible by speculative execution out-of-order
>     -- no direct analogy to this one - it's unlikely most forwarding 
> pipelines do speculative execution like a general purpose CPU does - but they 
> definitely do 'other work' while waiting for a lookup to happen
> 
> A common-garden x86 is unlikely to achieve such a rate for a few different 
> reasons:
>  - packets-in or packets-out go via DRAM then you need sufficient DRAM (page 
> opens/sec, DRAM bandwidth) to sustain at least one write and one read per 
> packet. Look closer at DRAM and see its speed, Pay attention to page 
> opens/sec, and what that consumes.
>  - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM 
> of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least 
> potentially saves you that DRAM write+read per packet
>   - ... but then do e.g. a LPM lookup, and best case that is back to a memory 
> access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes 
> it isn't.
>  - ... do more things to the packet (urpf lookups, counters) and it's yet 
> more lookups.
> 
> Software can achieve high rates, but note that a typical ASIC/NPU does on the 
> order of >100 separate lookups per packet, and 100 counter updates per packet.
> Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in 
> software on generic CPUs is also a series of tradeoffs.
> 
> 
> cheers,
> 
> lincoln.
> 

Reply via email to