Hi Willy,

Thank you very much for the in-depth analysis and configuration setting
suggestions.
I believe I have got the 3 key items to continue based on your mail:

1. Thread pinning
2. Fix system irq pinning accordingly
3. Listen on all threads.

I will post the configuration changes and the results.

Regards,
- Krishna


On Wed, Oct 10, 2018 at 6:39 PM Willy Tarreau <[email protected]> wrote:

> Hi Krishna,
>
> On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering)
> wrote:
> (...)
> > 1. HAProxy system:
> > Kernel: 4.17.13,
> > CPU: 48 core E5-2670 v3
> > Memory: 128GB memory
> > NIC: Mellanox 40g with IRQ pinning
> >
> > 2. Client, 48 core similar to server. Test command line:
> > wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
> >
> > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
> as
> > of
> >     Oct 2nd).
> > # haproxy-git -vv
> > HA-Proxy version 1.9-dev3 2018/09/29
> (...)
> > 4. HAProxy results for #processes and #threads
> > #    Threads-RPS Procs-RPS
> > 1 20903 19280
> > 2 46400 51045
> > 4 96587 142801
> > 8 172224 254720
> > 12 210451 437488
> > 16 173034 437375
> > 24 79069 519367
> > 32 55607 586367
> > 48 31739 596148
>
> Our largest thread test was on 12 cores and it happens that in your case
> it's also the optimal one.
>
> However I do have some comments about your config, before going back to
> real thread issues :
>
> > # cpu-map auto:1/1-48 0-39
>   => you must absolutely pin your processes, and they must be pinned
>      to cores *not* shared with the network card. That's critical.
>      Moreover it's also important that threads are not split across
>      multiple physical CPUs because the remote L3 cache access time
>      over QPI/UPI is terrible. When you run on 12 threads with two
>      12-cores/24-threads CPUs, you could very well have haproxy using
>      12 threads from 6 cores, and the NIC using 12 threads from the
>      other 6 cores of the same physical CPU. The second socket is,
>      as usual, useless for anything requiring low latency. However
>      it's perfect to run SSL. So you could be interested in testing
>      if running the NIC on one socket (try to figure what node the
>      PCIe lanes are physically connected to), and haproxy on the other
>      one. It *could* be possible that you get more performance from 12
>      cores of each but I strongly doubt it based on a number of tests.
>      If you use SSL however it's different as you will benefit from
>      lots of cores much more than low latency.
>
> > bind <VIP>:80 process 1/1-48
>   => it's also capital for scalability to have individual bind lines. Here
>      you have a single socket accessed from all 48 threads. There's no
>      efficient thread load balancing here. By having this :
>
>      bind <VIP>:80 process 1/1
>      bind <VIP>:80 process 1/2
>      ...
>      bind <VIP>:80 process 1/47
>      bind <VIP>:80 process 1/48
>
>      You will let the kernel perform the load balancing and distribute a
>      fair load to all threads. This way none of them will risk to pick a
>      larger share of the incoming connections than optimal. I know it's
>      annoying to configure at the moment, I've been thinking about having
>      a way to automatically iterate from a single config line (like the
>      "auto" feature of cpu-map), but for now it's not done.
>
> Now back to the thread measurements :
>
> > 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more
> time
> >    to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
> > Stats about Lock FD:
> > # write lock  : 143933900
> > # write unlock: 143933895 (-5)
> > # wait time for write     : 11370.245 msec
>
> This one definitely is huge. We know some work is still needed on this lock
> and that there are still a few low hanging fruits but not much savings to
> expect short term. This output is very revealing however of the importance
> of this lock.
>
> > # wait time for write/lock: 78.996 nsec
>
> That's roughly the time it takes to access the other CPU's cache, so using
> your two sockets for the same process definitely hurts a lot here.
>
> > Stats about Lock TASK_RQ:
> > # write lock  : 2062874
> > # write unlock: 2062875 (1)
> > # wait time for write     : 7820.234 msec
>
> This one is still far too large for what we'd hope, even though it
> has significantly shrunk since 1.8. It could be related to the poor
> distribution of the incoming connections across threads.
>
> > # wait time for write/lock: 3790.941 nsec
>
> Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting
> to suspect some longer tree walks than expected. Again, lack of fairness
> between threads can make this significantly worse than it should.
>
> > Stats about Lock TASK_WQ:
> > # write lock  : 2601227
> > # write unlock: 2601227 (0)
> > # wait time for write     : 5019.811 msec
> > # wait time for write/lock: 1929.786 nsec
>
> For this one I had an idea that could significantly improve the situation.
> It will unlikely be for 1.9 though.
>
> > Stats about Lock POOL:
> > # write lock  : 2823393
> > # write unlock: 2823393 (0)
> > # wait time for write     : 11984.706 msec
>
> Interesting. Very interesting... I thought we merged the lock-free pool
> code. Maybe I was mistaken. Or I misunderstood something about the nature
> of the changes that went in early during the 1.9-dev cycle. This one is
> not expected to be hard to address, and I even see how we could go further
> with a small thread-local cache since I developed such a thing a few years
> ago (though it was over-engineered) that could easily be retrofitted here.
>
> > # wait time for write/lock: 4244.788 nsec
>
> Looks like a malloc() time. Maybe the measure was huge during the ramp-up
> and amortized later.
>
> > Stats about Lock PROXY:
> > # write lock  : 291557
> > # write unlock: 291557 (0)
> > # wait time for write     : 109.694 msec
> > # wait time for write/lock: 376.235 nsec
>
> Good to know, could still be improved.
>
> > Stats about Lock SERVER:
> > # write lock  : 1188511
> > # write unlock: 1188511 (0)
> > # wait time for write     : 854.171 msec
> > # wait time for write/lock: 718.690 nsec
>
> Same here!
>
> > Stats about Lock LBPRM:
> > # write lock  : 1184709
> > # write unlock: 1184709 (0)
> > # wait time for write     : 778.947 msec
> > # wait time for write/lock: 657.501 nsec
>
> Here it will be less easy because it depends on the LB
> algorithm.
>
> > Stats about Lock BUF_WQ:
> > # write lock  : 669247
> > # write unlock: 669247 (0)
> > # wait time for write     : 252.265 msec
> > # wait time for write/lock: 376.939 nsec
>
> Post-1.9 I think we could make this one thread-local for the vast majority
> of buffer allocations. Short term, we would possibly try to improve the
> situation to avoid taking the lock for all successes. It's tricky but I
> think possible.
>
> > Stats about Lock VARS:
> > # write lock  : 901947
> > # write unlock: 901947 (0)
> > # wait time for write     : 299.224 msec
> > # wait time for write/lock: 331.753 nsec
>
> This one is used by variables.
>
> > 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3
> runs
> > at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
> > this behavior was not seen.
>
> We've had the exact opposite situation yesterday on another case. It really
> depends on the scheduling, but it's possible you've hit a bug in 1.9-dev3
> though.
>
> (...)
> > 1.9-dev3 after test:
> > 2185 CLOSE-WAIT
> > 2187 ESTAB
> >     All connections that were in CLOSE-WAIT were from the client,
>
> I suspect we still have a few cases where some client connections may
> only be checked upon client timeout.
>
> Many thanks, your test is very useful at this stage of the development!
> There definitely are some performance points we won't address for 1.9,
> but it confirms some observations and/or completes them.
>
> Best regards,
> Willy
>

Reply via email to