Hi all, In an experiment I conducted ~one weeks ago, I found out that ifnet serializer contention (if_output and NIC's txeof) had negative effect on network forwarding performance, so I created the following patch: http://leaf.dragonflybsd.org/~sephe/ifq_spinlock.diff3
The ideas behind this patch are: 1) altq is protected by its own spinlock instead of ifnet's serializer. 2) ifnet's serializer is pushed down into each ifnet.if_output implementation, i.e. if_output is called without ifnet's serializer being held 3) ifnet.if_start is dispatched to the CPU where NIC's interrupt is handled or where polling(4) is going to happen 4) ifnet.if_start's ipi sending is avoided as much as possible, using the same mechanism as avoiding real ipi sending in ipiq implementation I considered dispatching outgoing mbuf to NIC's interrupt/polling CPU to do the enqueue and if_start thus to avoid the spinlock, but upper layer, like TCP, processes ifq_handoff error (e.g. ENOBUFS); dispatching outgoing mbuf will break original semantics. However, the only source of error in ifq_handoff() is from ifq_enqueue(), i.e. only ifq_enqueue() must be called directly on the output path. Spinlock is chosen to protect ifnet.if_snd, so ifq_enqueue() and ifnet.if_start() could be departed. '1)' has one implication that ifq_poll->driver_encap->ifq_dequeue does not work, but I think driver could be easily converted to do ifq_dequeue+driver_encap without packet lossing. '1)' is the precondition to make '2)' and '3)' work. '2)' and '3)' together could avoid ifnet serializer contention. '4)' is added, since my another experiment shows that serious ipiq overflow could have very bad impact on overall performance. I have gathered following data before and after the patch: http://leaf.dragonflybsd.org/~sephe/data.txt http://leaf.dragonflybsd.org/~sephe/data_patched.txt boxa --+----> em0 -Routing box- em1 ----> msk0 boxb --+ boxa (2 x PIII 1G) and boxb (AthlonXP 2100+) each has one 32bit PCI 82540 em. Fastest stream I could generate from these two boxes using pktgen is @~720Kpps (~370Kpps on boxb and ~350Kpps on boxa). Routing box (it has some problem with ppc, which generated interrupts @~4000/s :P) is AthlonX2 3600+, with 1000PT; it has no problem to output @1400Kpps on single interface using pktgen. msk0 has monitor turned on and has no problem to accept a stream @1400Kpps. FF -- fast forwarding "stream target cpu1" -- stream generated to be dispatched to CPU1 on Routing box "stream target cpu0" -- stream generated to be dispatched to CPU0 on Routing box "stream target cpu0/cpu1" -- stream generated to be evenly dispatched to CPU0 and CPU1 on Routing box The stats is generated by: netstat -w 1 -I msk0 Fast forwarding is improved a lot in BGL case, probably because the time consumed on input path is greatly reduced by if_start dispatching. This patch does introduce regression in MP safe case when em0/em1 is on different CPU than the packet's target CPU on Routing box, may be caused by ipi bouncing between two CPUs (I didn't find the source of the problem yet). Fast forwarding perf drops a lot in MP safe case, if em0 and em1 are on different CPUs; reducing serializer contention does help (~40Kpps improvement). Something still needs to be figured out. So please review the patch. It is not finished yet, but the major part has been done and I want to call for reviewing before I strand too far away. Best Regards, sephe -- Live Free or Die