Hi Willy, Thank you very much for the in-depth analysis and configuration setting suggestions. I believe I have got the 3 key items to continue based on your mail:
1. Thread pinning 2. Fix system irq pinning accordingly 3. Listen on all threads. I will post the configuration changes and the results. Regards, - Krishna On Wed, Oct 10, 2018 at 6:39 PM Willy Tarreau <[email protected]> wrote: > Hi Krishna, > > On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering) > wrote: > (...) > > 1. HAProxy system: > > Kernel: 4.17.13, > > CPU: 48 core E5-2670 v3 > > Memory: 128GB memory > > NIC: Mellanox 40g with IRQ pinning > > > > 2. Client, 48 core similar to server. Test command line: > > wrk -c 4800 -t 48 -d 30s http://<IP:80>/128 > > > > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout > as > > of > > Oct 2nd). > > # haproxy-git -vv > > HA-Proxy version 1.9-dev3 2018/09/29 > (...) > > 4. HAProxy results for #processes and #threads > > # Threads-RPS Procs-RPS > > 1 20903 19280 > > 2 46400 51045 > > 4 96587 142801 > > 8 172224 254720 > > 12 210451 437488 > > 16 173034 437375 > > 24 79069 519367 > > 32 55607 586367 > > 48 31739 596148 > > Our largest thread test was on 12 cores and it happens that in your case > it's also the optimal one. > > However I do have some comments about your config, before going back to > real thread issues : > > > # cpu-map auto:1/1-48 0-39 > => you must absolutely pin your processes, and they must be pinned > to cores *not* shared with the network card. That's critical. > Moreover it's also important that threads are not split across > multiple physical CPUs because the remote L3 cache access time > over QPI/UPI is terrible. When you run on 12 threads with two > 12-cores/24-threads CPUs, you could very well have haproxy using > 12 threads from 6 cores, and the NIC using 12 threads from the > other 6 cores of the same physical CPU. The second socket is, > as usual, useless for anything requiring low latency. However > it's perfect to run SSL. So you could be interested in testing > if running the NIC on one socket (try to figure what node the > PCIe lanes are physically connected to), and haproxy on the other > one. It *could* be possible that you get more performance from 12 > cores of each but I strongly doubt it based on a number of tests. > If you use SSL however it's different as you will benefit from > lots of cores much more than low latency. > > > bind <VIP>:80 process 1/1-48 > => it's also capital for scalability to have individual bind lines. Here > you have a single socket accessed from all 48 threads. There's no > efficient thread load balancing here. By having this : > > bind <VIP>:80 process 1/1 > bind <VIP>:80 process 1/2 > ... > bind <VIP>:80 process 1/47 > bind <VIP>:80 process 1/48 > > You will let the kernel perform the load balancing and distribute a > fair load to all threads. This way none of them will risk to pick a > larger share of the incoming connections than optimal. I know it's > annoying to configure at the moment, I've been thinking about having > a way to automatically iterate from a single config line (like the > "auto" feature of cpu-map), but for now it's not done. > > Now back to the thread measurements : > > > 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more > time > > to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: > > Stats about Lock FD: > > # write lock : 143933900 > > # write unlock: 143933895 (-5) > > # wait time for write : 11370.245 msec > > This one definitely is huge. We know some work is still needed on this lock > and that there are still a few low hanging fruits but not much savings to > expect short term. This output is very revealing however of the importance > of this lock. > > > # wait time for write/lock: 78.996 nsec > > That's roughly the time it takes to access the other CPU's cache, so using > your two sockets for the same process definitely hurts a lot here. > > > Stats about Lock TASK_RQ: > > # write lock : 2062874 > > # write unlock: 2062875 (1) > > # wait time for write : 7820.234 msec > > This one is still far too large for what we'd hope, even though it > has significantly shrunk since 1.8. It could be related to the poor > distribution of the incoming connections across threads. > > > # wait time for write/lock: 3790.941 nsec > > Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting > to suspect some longer tree walks than expected. Again, lack of fairness > between threads can make this significantly worse than it should. > > > Stats about Lock TASK_WQ: > > # write lock : 2601227 > > # write unlock: 2601227 (0) > > # wait time for write : 5019.811 msec > > # wait time for write/lock: 1929.786 nsec > > For this one I had an idea that could significantly improve the situation. > It will unlikely be for 1.9 though. > > > Stats about Lock POOL: > > # write lock : 2823393 > > # write unlock: 2823393 (0) > > # wait time for write : 11984.706 msec > > Interesting. Very interesting... I thought we merged the lock-free pool > code. Maybe I was mistaken. Or I misunderstood something about the nature > of the changes that went in early during the 1.9-dev cycle. This one is > not expected to be hard to address, and I even see how we could go further > with a small thread-local cache since I developed such a thing a few years > ago (though it was over-engineered) that could easily be retrofitted here. > > > # wait time for write/lock: 4244.788 nsec > > Looks like a malloc() time. Maybe the measure was huge during the ramp-up > and amortized later. > > > Stats about Lock PROXY: > > # write lock : 291557 > > # write unlock: 291557 (0) > > # wait time for write : 109.694 msec > > # wait time for write/lock: 376.235 nsec > > Good to know, could still be improved. > > > Stats about Lock SERVER: > > # write lock : 1188511 > > # write unlock: 1188511 (0) > > # wait time for write : 854.171 msec > > # wait time for write/lock: 718.690 nsec > > Same here! > > > Stats about Lock LBPRM: > > # write lock : 1184709 > > # write unlock: 1184709 (0) > > # wait time for write : 778.947 msec > > # wait time for write/lock: 657.501 nsec > > Here it will be less easy because it depends on the LB > algorithm. > > > Stats about Lock BUF_WQ: > > # write lock : 669247 > > # write unlock: 669247 (0) > > # wait time for write : 252.265 msec > > # wait time for write/lock: 376.939 nsec > > Post-1.9 I think we could make this one thread-local for the vast majority > of buffer allocations. Short term, we would possibly try to improve the > situation to avoid taking the lock for all successes. It's tricky but I > think possible. > > > Stats about Lock VARS: > > # write lock : 901947 > > # write unlock: 901947 (0) > > # wait time for write : 299.224 msec > > # wait time for write/lock: 331.753 nsec > > This one is used by variables. > > > 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 > runs > > at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14, > > this behavior was not seen. > > We've had the exact opposite situation yesterday on another case. It really > depends on the scheduling, but it's possible you've hit a bug in 1.9-dev3 > though. > > (...) > > 1.9-dev3 after test: > > 2185 CLOSE-WAIT > > 2187 ESTAB > > All connections that were in CLOSE-WAIT were from the client, > > I suspect we still have a few cases where some client connections may > only be checked upon client timeout. > > Many thanks, your test is very useful at this stage of the development! > There definitely are some performance points we won't address for 1.9, > but it confirms some observations and/or completes them. > > Best regards, > Willy >

