On Wed, Mar 11, 2009 at 8:09 AM, Matthew Dillon <dil...@apollo.backplane.com> wrote: > > :On my Phenom9550 (2GB memory) w/ dual port 82571EB, one direction > :forwarding, packets even spreaded to each core. INVARIANTS is turned > :on in the kernel config (I don't think it makes much sense to run a > :system without INVARIANTS). > : > :... > : > :For pure forwarding (i.e. no firewalling), the major bottle neck is on > :transmit path (as I have measured, if you used spinlock on ifq, the > :whole forwarding could be choked). Hopefully ongoing multi queue work > :could make the situation much better. > : > :Best Regards, > :sephe > > I wonder, would it make sense to make all ifq interactions within > the kernel multi-cpu (one per cpu) even if the paricular hardware > does not have multiple queues? It seems to me it would be fairly easy > to do using bsfl() on a cpumask_t to locate cpus with non-empty queues.
Yes, it is absolutely doable with plain classic ifq. IMHO, for other types of ifq, situation is different. Even if we could make per-cpu internal mbuf queues, the major part of ifq internal states updating still needs kinda of protection; the quickest way popping up in my mind is spinlock or serializer, however, this may eliminate the benefit we could obtain from the per-cpu internal mbuf queues. > > The kernel could then implement a protected entry point into the device > driver to run the queue(s). On any given packet heading out the interface > the kernel would poll the spin lock and enter the device driver if it > is able to get it. If the spin lock cannot be acquired the device > driver has already been entered into by another cpu and will pick up the > packet on the per-cpu queue. Except the per-cpu ifq, our current way is only slightly different from your description. In the current implementation, when the first CPU enqueue a packet to ifq, it will first check whether the NIC TX is starting, if it is, then after enqueuing the packet, the current thread just keep going. If the NIC TX is not started, the current CPU will mark the NIC TX to be started and try enter the NIC's serializer. If the serializer trying failed, which indicates there is a contention between interrupt or poll, then the current CPU will send an ipi to NIC's interrupt handling CPU or the NIC's polling CPU. So, except for the cost of the ifq serialization, the ipi sending, I mention above, also has some cost under certain work load. I originally planned following way to avoid ifq serializer cost and amortize the ipi sending cost: We add a small mbuf queue (save 32 or 64 mbufs at most) per-cpu per-ifq, transmit thread just enqueue packets to this queue, once this queue overflows or the current thread is going to sleep, a ipi is send to NIC's interrupt CPU or its polling CPU. ifq enqueuing happens there as well as the calling of if_start(). This may also help when # of hardware TX queues < ncpus2. However, it has one requirement that all network output happens in the network threads (I will have to say this is also one of the major reason I put main parts of TCP callouts into TCP threads). Well, it is a vague idea currently ... Best Regards, sephe -- Live Free or Die