On Thu, Dec 20, 2012 at 3:17 AM, Hal Murray <hmur...@megapathdsl.net> wrote: > > If I was going to do something like that, I'd build a small/simple CPU and do > the work in microcode.
There are two ppc 440 cpus already onboard the 10GigE device, I think. It's a REALLY NICE fpga. http://netfpga.org/10G_specs.html http://www.xilinx.com/support/documentation/data_sheets/ds100.pdf If we really wanted to get a jump on the high end: http://www.hitechglobal.com/boards/100gig.htm > >> implementing {n,e,s}fq_codel onboard looks very feasible > > How many lines of assembler code would it take? I could do a dump of the current code into any given assembly language. It's not a lot, but there are a lot of out of band functions. > How many registers do you need? Do you need any memory other than queues? > Maybe counters? The total overhead for fq_codel is presently 1024*64 bytes for 1024 flows, and 4-8k of pointer overhead (32 or 64 bit). I would argue for such a device to hash to 64k flows, or heck, higher. And the per-flow overhead can be reduced a lot in a dedicated device. As to what of that needs to be on-board the fpga or off-board, is a fairly good question. The sfq/codel queue management stuff sits nicely in parallel with getting the packets so that's an obvious second bus/cache arch... >> The only thing that is seriously serial about fq_codel is shooting the >> biggest flow when the queue limit is exceeded, and that could be made >> embarrassingly parallel with enough gates.There are no doubt other tricky >> issues. > > Would it be better to do the fq work in the main CPU and let the FPGA grab Well there are a few things that would benefit from moving directly into hardware - the 5 tuple hash, for example. > packets from some shared data structure in memory? The problem that I would like to beat is that TSO/GSO seem to be necessary on the host processor to reduce the interrupt count to sanity at 10GigE. A goal here would be to allow for TSO generation (and GRO receive) to hand off to the board, but for the board to interleave and aqm packets from there to the wire. Rather than a tx descriptor ring you'd have a tx descriptor list and tx completion ring so that you could send streams out of order. > Can you work out a > memory structure that doesn't need locks? The enqueue and dequeue algorithms are entirely decoupled, with the exception of this error handling phase of (out of queue space) One thought would be to track packet count on enqueue (this is more "sfq"-like than fq_codel-like) which still has a tiny lock... :grumble: > > > -- > These are my opinions. I hate spam. > > > -- Dave Täht Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html _______________________________________________ Codel mailing list Codel@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/codel