Hi Krishna,

On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering) wrote:
(...)
> 1. HAProxy system:
> Kernel: 4.17.13,
> CPU: 48 core E5-2670 v3
> Memory: 128GB memory
> NIC: Mellanox 40g with IRQ pinning
> 
> 2. Client, 48 core similar to server. Test command line:
> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
> 
> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as
> of
>     Oct 2nd).
> # haproxy-git -vv
> HA-Proxy version 1.9-dev3 2018/09/29
(...)
> 4. HAProxy results for #processes and #threads
> #    Threads-RPS Procs-RPS
> 1 20903 19280
> 2 46400 51045
> 4 96587 142801
> 8 172224 254720
> 12 210451 437488
> 16 173034 437375
> 24 79069 519367
> 32 55607 586367
> 48 31739 596148

Our largest thread test was on 12 cores and it happens that in your case
it's also the optimal one.

However I do have some comments about your config, before going back to
real thread issues :

> # cpu-map auto:1/1-48 0-39
  => you must absolutely pin your processes, and they must be pinned
     to cores *not* shared with the network card. That's critical.
     Moreover it's also important that threads are not split across
     multiple physical CPUs because the remote L3 cache access time
     over QPI/UPI is terrible. When you run on 12 threads with two
     12-cores/24-threads CPUs, you could very well have haproxy using
     12 threads from 6 cores, and the NIC using 12 threads from the
     other 6 cores of the same physical CPU. The second socket is,
     as usual, useless for anything requiring low latency. However
     it's perfect to run SSL. So you could be interested in testing
     if running the NIC on one socket (try to figure what node the
     PCIe lanes are physically connected to), and haproxy on the other
     one. It *could* be possible that you get more performance from 12
     cores of each but I strongly doubt it based on a number of tests.
     If you use SSL however it's different as you will benefit from
     lots of cores much more than low latency.

> bind <VIP>:80 process 1/1-48
  => it's also capital for scalability to have individual bind lines. Here
     you have a single socket accessed from all 48 threads. There's no
     efficient thread load balancing here. By having this :

     bind <VIP>:80 process 1/1
     bind <VIP>:80 process 1/2
     ...
     bind <VIP>:80 process 1/47
     bind <VIP>:80 process 1/48

     You will let the kernel perform the load balancing and distribute a
     fair load to all threads. This way none of them will risk to pick a
     larger share of the incoming connections than optimal. I know it's
     annoying to configure at the moment, I've been thinking about having
     a way to automatically iterate from a single config line (like the
     "auto" feature of cpu-map), but for now it's not done.

Now back to the thread measurements :

> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
>    to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
> Stats about Lock FD:
> # write lock  : 143933900
> # write unlock: 143933895 (-5)
> # wait time for write     : 11370.245 msec

This one definitely is huge. We know some work is still needed on this lock
and that there are still a few low hanging fruits but not much savings to
expect short term. This output is very revealing however of the importance
of this lock.

> # wait time for write/lock: 78.996 nsec

That's roughly the time it takes to access the other CPU's cache, so using
your two sockets for the same process definitely hurts a lot here.

> Stats about Lock TASK_RQ:
> # write lock  : 2062874
> # write unlock: 2062875 (1)
> # wait time for write     : 7820.234 msec

This one is still far too large for what we'd hope, even though it
has significantly shrunk since 1.8. It could be related to the poor
distribution of the incoming connections across threads.

> # wait time for write/lock: 3790.941 nsec

Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting
to suspect some longer tree walks than expected. Again, lack of fairness
between threads can make this significantly worse than it should.

> Stats about Lock TASK_WQ:
> # write lock  : 2601227
> # write unlock: 2601227 (0)
> # wait time for write     : 5019.811 msec
> # wait time for write/lock: 1929.786 nsec

For this one I had an idea that could significantly improve the situation.
It will unlikely be for 1.9 though.

> Stats about Lock POOL:
> # write lock  : 2823393
> # write unlock: 2823393 (0)
> # wait time for write     : 11984.706 msec

Interesting. Very interesting... I thought we merged the lock-free pool
code. Maybe I was mistaken. Or I misunderstood something about the nature
of the changes that went in early during the 1.9-dev cycle. This one is
not expected to be hard to address, and I even see how we could go further
with a small thread-local cache since I developed such a thing a few years
ago (though it was over-engineered) that could easily be retrofitted here.

> # wait time for write/lock: 4244.788 nsec

Looks like a malloc() time. Maybe the measure was huge during the ramp-up
and amortized later.

> Stats about Lock PROXY:
> # write lock  : 291557
> # write unlock: 291557 (0)
> # wait time for write     : 109.694 msec
> # wait time for write/lock: 376.235 nsec

Good to know, could still be improved.

> Stats about Lock SERVER:
> # write lock  : 1188511
> # write unlock: 1188511 (0)
> # wait time for write     : 854.171 msec
> # wait time for write/lock: 718.690 nsec

Same here!

> Stats about Lock LBPRM:
> # write lock  : 1184709
> # write unlock: 1184709 (0)
> # wait time for write     : 778.947 msec
> # wait time for write/lock: 657.501 nsec

Here it will be less easy because it depends on the LB
algorithm.

> Stats about Lock BUF_WQ:
> # write lock  : 669247
> # write unlock: 669247 (0)
> # wait time for write     : 252.265 msec
> # wait time for write/lock: 376.939 nsec

Post-1.9 I think we could make this one thread-local for the vast majority
of buffer allocations. Short term, we would possibly try to improve the
situation to avoid taking the lock for all successes. It's tricky but I
think possible.

> Stats about Lock VARS:
> # write lock  : 901947
> # write unlock: 901947 (0)
> # wait time for write     : 299.224 msec
> # wait time for write/lock: 331.753 nsec

This one is used by variables.

> 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
> this behavior was not seen.

We've had the exact opposite situation yesterday on another case. It really
depends on the scheduling, but it's possible you've hit a bug in 1.9-dev3
though.

(...)
> 1.9-dev3 after test:
> 2185 CLOSE-WAIT
> 2187 ESTAB
>     All connections that were in CLOSE-WAIT were from the client,

I suspect we still have a few cases where some client connections may
only be checked upon client timeout.

Many thanks, your test is very useful at this stage of the development!
There definitely are some performance points we won't address for 1.9,
but it confirms some observations and/or completes them.

Best regards,
Willy

Reply via email to