On Tue, Jul 07, 2015 at 02:15:25PM +0530, Krishna Kumar (Engineering) wrote: > > That's normal if you pinned the IRQs to cpus 0-47, you'd like to pin IRQs > > only to the CPUs of the first socket (ie: 0, 2, 4, 6, ..., 46 if I > understand > > it right). > > I think that will still not solve the issue since packets can come on irq1 > and > directly wired to cpu0, but the rx receive code path on cpu0 decides which > cpu this is meant for, and gets that lock and still does ipi.
Unless I'm wrong, for me ixgbe does its own RSS and uses the Rx queues. Each Rx queue is bound to an IRQ, and each IRQ may be delivered to a set of CPUs. I'd say that I've never seen even a iota of CPU usage on the wrong CPU when using ixgbe, so I'm quite sure that some of your IRQs are not properly bound, which could be verified by simply running "grep eth" in /proc/interrupts, though I can easily undertand that it's not easy to read with 48 columns! > I expected > that > packets for a flow will go to the correct cpu, and thought this could be > related > to the size of the intel flow director? > > root@1098366dea41:~# ethtool -S em1 | grep fdir_ > fdir_match: 1596533665 (29%) > fdir_miss: 3908617542 (71%) > fdir_overflow: 25408 > > The Intel document says: "And the IntelĀ® Ethernet Flow Director > Perfect-Match Filter Table has to be large enough to capture the > unique flows a given controller would typically see. For that reason, > Intel Ethernet Flow Director???s Perfect-Match Filter Table has 8k entries." > > http://www.intel.in/content/dam/www/public/us/en/documents/white-papers/intel-ethernet-flow-director.pdf > > 8K is very small for haproxy serving many more connection. No it's not that small because a flow is something ephemeral. Even when you run with hundreds of thousands of concurrent connections, the time between an outgoing ACK and the next data packet is short enough for the flow to be matched, especially on a local network when it's a matter of hundreds of microseconds at worst. > While trying to simulate in a small lab setup, I ran 4 wrk's to haproxy, > top > showed 3 haproxy's running on cpus 2,4 and 8, and tx/rx counters of only > 2,4,8 incremented during the run, and always consistently. So it worked > as expected, but the load on the system was very low. > > In our production testing, thousand client VM's are doing 50K connections > each to haproxy (running on 60 servers), and here, I noticed that haproxy > on every server runs on 0,2,4..22, but rx/tx of all queues increment. Just out of curiosity, why not configure the lab to reproduce the same setup as the prod, ie 12 processes as well ? > Assuming this is the issue, which I am not sure of, do you have any ideas > how to get around this, or any other suggestions? At the very least, check the interrupts per CPU. There I'm sure you'll find that you missed some of them and that they're still delivered to the wrong socket. Also keep in mind that a down-up cycle on an interface unbinds the IRQs, so you have to rebind them by hand. This is something which easily happens by mistake during troubleshooting sessions. Regards, Willy

