Re: [rrg] BGP scaling limit?

Robin Whittle Wed, 31 Dec 2008 19:59:57 -0800

Short version:  Limits on using PC hardware for router FIBs.
                Fast SRAM is needed, not DRAM.

DDR3 DRAM QDR II+ SRAM

Access time 13 nsec 0.5 nsec

Cycle time ~20? nsec 3.0 nsec

Read/Write cycles ~50? 333
per microsecond

(All in support of what Tony wrote about DFZ
growth outpacing DRAM speed improvements.)

Router RIBs with modern CPUs.
Fundamental problems with too many prefixes.
We need a solution to be fully deployed within
about a decade.

Hi Christopher,

You wrote, in part:

> One point Tony made was that soon, perhaps, you won't be able to make
> a lookup across the memory device holding the FIB fast enough to
> service a packet on the fastest known interfaces today, presuming the
> FIB grows in memory at some set rate (look at the graphs Vince Fuller
> or Geoff Huston have for approximations of the rates).

While a COTS PC costing $2k or less can have plenty of RAM for
holding a software based FIB, its packet forwarding capacity is
limited by:

1 - Physical interface speeds, including PCI bus speed.

2 - The CPU and memory bandwidth required to implement the packet
input and output processing.

3 - The CPU and memory bandwidth required to classify each incoming
packet.

The last point is a major bottleneck for any PC-based approach. PC
DRAM (Dynamic RAM) has a long access time, but then pumps out a block
of data to fill the CPU's cache. There are two clock cycles to clock
in the address, then the sense amps swing into action and data
becomes available 20 to 40 nsec later. Then the sense amps need to
recharge their reference capacitors before the array can be read
again. (My understanding, which may be out of date, is that DRAM
sense amps compare a fully charged half-size capacitor with the
capacitance of a memory cell, which is a full-sized capacitor either
fully charged or not charged at all.)

Bill referred to higher data rates for DRAM: "400mhz to I think
1600mhz today". That is the rate at which it pumps out the data
after the initial delays of multiplexing in the address and waiting
for the sense amps to settle.

I will look for a recent DRAM spec sheet:

DDR3 DRAM 800 to 1600 Mbps per pin (!):

http://www.samsung.com/global/business/semiconductor/products/dram/Products_DDR3SDRAM.html

http://www.samsung.com/global/business/semiconductor/support/brochures/downloads/memory/ddr3_datasheet_200807.pdf

"Latency time of only 13 nsec". 13 nsec is way too long. This
series goes up to 4G bits per chip.

http://www.samsung.com/global/business/semiconductor/products/dram/downloads/ddr3_device_operation_timing_diagram_may_08.pdf

68 pages of timing diagrams. These are complex beasts. I don't have
time to decipher all this stuff.

Suffice to say that in an FIB, there needs to be multiple short reads
to various locations in RAM in order to process each packet. (Except
for IPv4, by doing a single lookup into a 16M word memory, assuming
/24 prefix length limits for most packets.)

For longer IPv6 addresses, the FIB needs to chew its way through the
bits until it arrives at the FIB entry with the Forwarding
Equivalence Class, which is primarily which interface the packet
should be forwarded on, plus which output queue of that interface to
put the packet into.

The delay time in getting those bytes from RAM is crucial. It
doesn't help that these SDRAMs, which are made for filling CPU
caches, can pump out further data at some extraordinarily high speed.

The best type of RAM would be QDR II+ fast SRAM:

http://www.samsung.com/global/business/semiconductor/products/sram/Products_HighSpeedSRAM.html

These go up to 72M bit per chip. The access time (address in to data
out) is as low as 0.5 nsec (26 times faster than the 13nsec noted
above for SDRAM) and the cycle time is 3nsec, since they can do 333M
reads or writes per usec. SDRAM could probably only do 50 such
cycles per usec.

To some extent, CPU caching might help with an FIB function, by
already having the data needed for packets with the same destination
address as one recently processed. However, CPU caches can only
hold a limited number of blocks of data. In a busy router, there
will surely be a greater diversity of destination addresses than
could ever be handled with cached data in the CPU.

I am thinking mainly about IPv6 addresses. If it is purely IPv4,
with /24 length limits, then an FIB is pretty easy to implement in
specialised hardware with 16M words of fast static RAM, with a single
read cycle (333 a usec).

Generally, it is onerous and slow to have to do multiple RAM lookups
in PC DRAM, as is required to navigate the FIB data in order to
process a single packet.

What is needed is very fast Static RAM: apply the address and then
get the data a few nanoseconds later. There is no recovery time,
since the sense amps have no capacitors.

SRAM cells involve 6 transistors and are correspondingly less dense
per chip, more expensive and more power hungry than DRAM. They also
have more pins per chip, since the address is presented in parallel.

I am not sure how practical it would be to string a bunch of PCs or
the like in parallel to form a higher capacity router. To some
extent a PC could split up the incoming packets to multiple FIB PCs,
and combine their outputs, one such PC for every link to a neighbour.
However, that would mean half a dozen or dozens of PCs all strung
together in a fragile arrangement.

Except for low data rates, I think DFZ routers are always going to be
"big-iron" (high reliability, not based on consumer electronics - and
priced like mainframe computers) devices. For the highest data
rates, there will always need to be FIB heroics such as Cisco's
USD$80k MSC board for the CRS-1:

http://www.firstpr.com.au/ip/sram-ip-forwarding/router-fib/

The FIB of the MSC board is based on a single ASIC with 188 32 bit
CPUs (250MHz), somehow sharing a large amount of (I assume) fast
SRAM, to implement the Tree-Bitmap algorithm. It dissipates as much
as 375 watts and handles 40Gbps, full duplex.

Implementing large ACLs in software, such as in a PC-based router, is
probably going to be prohibitively slow. The CRS-1 MSC board uses
TCAM for that.

I guess it would be possible to build a very high performance RIB
with modern PC technology: 4 and 8 core 64 bit CPUs from Intel and
AMD, with 32 gigs etc. of ECC DRAM. This could probably be done in a
COTS PC, or in a suitably redesigned "big-iron" router.

Maybe a high capacity RIB (for 100 million prefixes and a dozen or
more neighbours) could be done with multiple CPU-RAM systems sharing
the load, but this sounds like a nightmare.

Even with an RIB of arbitrary capacity and great speed, ramping up
the number of prefixes to 10 million, 100 million or whatever becomes
objectionable since even if all DFZ routers could cope with these
numbers of prefixes, the flurry of BGP messages during an outage
would become a problem. Any one outage is likely to affect some
proportion of the total number of prefixes, and if we have 1000 times
more prefixes, such as 250 million, this is 1000 times more BGP
messages per outage than today. Surely the delays inherent in
sending and processing these bursts of messages will lead to slower
convergence times and greater chance of instabilities.

Maybe this won't be such a problem with faster links between routers,
but the whole idea of simply throwing expensive hardware at the
problem, for every DFZ router, in order to accommodate the required
number of multihomed end-user networks (100 million, a billion ..?)
ugly and unscalable.

So we need to add a new architectural element to the Net.

My guess is we need to get this fully operational by 2016 to 2018 or
so, which means finalising the design and making it available by 2015
or so - and by making it so attractive to end-user networks, ISPs and
the companies who will run the mapping distribution system, OITRDs
etc. that it will be rapidly adopted by the end of the next decade.

- Robin http://www.firstpr.com.au/ip/ivip/

_______________________________________________
rrg mailing list
[email protected]
https://www.irtf.org/mailman/listinfo/rrg

Re: [rrg] BGP scaling limit?

Reply via email to