1. haproxy config: Same as given above (both processes and threads were
given in the mail)
2. nginx: default, no changes.
3. sysctl's: nothing set. All changes as described earlier (e.g.
irqbalance, irq pinning, etc).
4. nf_conntrack: disabled
5. dmesg: no messages.

With the same system and settings, threads gives 18x lesser RPS than
processes, along with
the other 2 issues given in my mail today.


On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин <[email protected]> wrote:

> haproxy config, nginx config
> non default sysctl (if any)
>
> as a side note, can you have a look at "dmesg" output ? do you have nf
> conntrack enabled ? what are its limits ?
>
> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
> [email protected]>:
>
>> Sure.
>>
>> 1. Client: Use one of the following two setup's:
>>         - a single baremetal (48 core, 40g) system
>>           Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
>>         - 100 2 core vm's.
>>           Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
>>           each VM and summarize the results using some
>>           parallel-ssh setup.
>>
>> 2. HAProxy running on a single baremetal (same system config
>>     as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
>>     cores of the same NUMA node for each irq, kill irqbalance, with
>>     haproxy configuration file as given in my first mail. Around 60
>>     backend servers are configured in haproxy.
>>
>> 3. Backend servers are 2 core VM's running nginx and serving
>>     a file called "/128", which is 128 bytes in size.
>>
>> Let me know if you need more information.
>>
>> Thanks,
>> - Krishna
>>
>>
>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин <[email protected]>
>> wrote:
>>
>>> load testing is somewhat good.
>>> can you describe an overall setup ? (I want to reproduce and play with
>>> it)
>>>
>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
>>> [email protected]>:
>>>
>>>> Re-sending in case this mail was missed. To summarise the 3 issues seen:
>>>>
>>>> 1. Performance drops 18x with higher number of nbthreads as compared to
>>>> nbprocs.
>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
>>>> (for 1.9-dev3
>>>>     for nbprocs and nbthreads).
>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains
>>>> in either
>>>>      CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>>>> servers), till
>>>>      the server/client timeout expires.
>>>>
>>>> The tests for threads and processes were done on the same systems, so
>>>> there is
>>>> no difference in system parameters.
>>>>
>>>> Thanks,
>>>> - Krishna
>>>>
>>>>
>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Willy, and community developers,
>>>>>
>>>>> I am not sure if I am doing something wrong, but wanted to report
>>>>> some issues that I am seeing. Please let me know if this is a problem.
>>>>>
>>>>> 1. HAProxy system:
>>>>> Kernel: 4.17.13,
>>>>> CPU: 48 core E5-2670 v3
>>>>> Memory: 128GB memory
>>>>> NIC: Mellanox 40g with IRQ pinning
>>>>>
>>>>> 2. Client, 48 core similar to server. Test command line:
>>>>> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
>>>>>
>>>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
>>>>> checkout as of
>>>>>     Oct 2nd).
>>>>> # haproxy-git -vv
>>>>> HA-Proxy version 1.9-dev3 2018/09/29
>>>>> Copyright 2000-2018 Willy Tarreau <[email protected]>
>>>>>
>>>>> Build options :
>>>>>   TARGET  = linux2628
>>>>>   CPU     = generic
>>>>>   CC      = gcc
>>>>>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
>>>>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
>>>>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
>>>>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>>>>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>>>>
>>>>> Default settings :
>>>>>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents =
>>>>> 200
>>>>>
>>>>> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>>>> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>>>> OpenSSL library supports TLS extensions : yes
>>>>> OpenSSL library supports SNI : yes
>>>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
>>>>> Built with transparent proxy support using: IP_TRANSPARENT
>>>>> IPV6_TRANSPARENT IP_FREEBIND
>>>>> Encrypted password support via crypt(3): yes
>>>>> Built with multi-threading support.
>>>>> Built with PCRE version : 8.38 2015-11-23
>>>>> Running on PCRE version : 8.38 2015-11-23
>>>>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>>>>> Built with zlib version : 1.2.8
>>>>> Running on zlib version : 1.2.8
>>>>> Compression algorithms supported : identity("identity"),
>>>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
>>>>> Built with network namespace support.
>>>>>
>>>>> Available polling systems :
>>>>>       epoll : pref=300,  test result OK
>>>>>        poll : pref=200,  test result OK
>>>>>      select : pref=150,  test result OK
>>>>> Total: 3 (3 usable), will use epoll.
>>>>>
>>>>> Available multiplexer protocols :
>>>>> (protocols markes as <default> cannot be specified using 'proto'
>>>>> keyword)
>>>>>               h2 : mode=HTTP       side=FE
>>>>>        <default> : mode=TCP|HTTP   side=FE|BE
>>>>>
>>>>> Available filters :
>>>>> [SPOE] spoe
>>>>> [COMP] compression
>>>>> [TRACE] trace
>>>>>
>>>>> 4. HAProxy results for #processes and #threads
>>>>> #    Threads-RPS Procs-RPS
>>>>> 1 20903 19280
>>>>> 2 46400 51045
>>>>> 4 96587 142801
>>>>> 8 172224 254720
>>>>> 12 210451 437488
>>>>> 16 173034 437375
>>>>> 24 79069 519367
>>>>> 32 55607 586367
>>>>> 48 31739 596148
>>>>>
>>>>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot
>>>>> more time
>>>>>    to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
>>>>> Stats about Lock FD:
>>>>> # write lock  : 143933900
>>>>> # write unlock: 143933895 (-5)
>>>>> # wait time for write     : 11370.245 msec
>>>>> # wait time for write/lock: 78.996 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock TASK_RQ:
>>>>> # write lock  : 2062874
>>>>> # write unlock: 2062875 (1)
>>>>> # wait time for write     : 7820.234 msec
>>>>> # wait time for write/lock: 3790.941 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock TASK_WQ:
>>>>> # write lock  : 2601227
>>>>> # write unlock: 2601227 (0)
>>>>> # wait time for write     : 5019.811 msec
>>>>> # wait time for write/lock: 1929.786 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock POOL:
>>>>> # write lock  : 2823393
>>>>> # write unlock: 2823393 (0)
>>>>> # wait time for write     : 11984.706 msec
>>>>> # wait time for write/lock: 4244.788 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock LISTENER:
>>>>> # write lock  : 184
>>>>> # write unlock: 184 (0)
>>>>> # wait time for write     : 0.011 msec
>>>>> # wait time for write/lock: 60.554 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock PROXY:
>>>>> # write lock  : 291557
>>>>> # write unlock: 291557 (0)
>>>>> # wait time for write     : 109.694 msec
>>>>> # wait time for write/lock: 376.235 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock SERVER:
>>>>> # write lock  : 1188511
>>>>> # write unlock: 1188511 (0)
>>>>> # wait time for write     : 854.171 msec
>>>>> # wait time for write/lock: 718.690 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock LBPRM:
>>>>> # write lock  : 1184709
>>>>> # write unlock: 1184709 (0)
>>>>> # wait time for write     : 778.947 msec
>>>>> # wait time for write/lock: 657.501 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock BUF_WQ:
>>>>> # write lock  : 669247
>>>>> # write unlock: 669247 (0)
>>>>> # wait time for write     : 252.265 msec
>>>>> # wait time for write/lock: 376.939 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock STRMS:
>>>>> # write lock  : 9335
>>>>> # write unlock: 9335 (0)
>>>>> # wait time for write     : 0.910 msec
>>>>> # wait time for write/lock: 97.492 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>> Stats about Lock VARS:
>>>>> # write lock  : 901947
>>>>> # write unlock: 901947 (0)
>>>>> # wait time for write     : 299.224 msec
>>>>> # wait time for write/lock: 331.753 nsec
>>>>> # read lock   : 0
>>>>> # read unlock : 0 (0)
>>>>> # wait time for read      : 0.000 msec
>>>>> # wait time for read/lock : 0.000 nsec
>>>>>
>>>>> 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3
>>>>> runs
>>>>> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
>>>>>         this behavior was not seen. Ran the following command for both:
>>>>> "ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
>>>>>     1.8.14 during test:
>>>>> 451 SYN-SENT
>>>>> 9166 ESTAB
>>>>>     1.8.14 after test:
>>>>> 2 ESTAB
>>>>>
>>>>>     1.9-dev3 during test:
>>>>> 109 SYN-SENT
>>>>> 9400 ESTAB
>>>>>     1.9-dev3 after test:
>>>>> 2185 CLOSE-WAIT
>>>>> 2187 ESTAB
>>>>>     All connections that were in CLOSE-WAIT were from the client,
>>>>> while all
>>>>>     connections in ESTAB state were to the server. This lasted for 30
>>>>> seconds.
>>>>>     On the client system, all sockets were in FIN-WAIT-2 state:
>>>>>     2186 FIN-WAIT-2
>>>>>     This (2185/2186) seems to imply that client closed the connection
>>>>> but
>>>>>     haproxy did not close the socket for 30 seconds. This also results
>>>>> in
>>>>>     high CPU utilization on haproxy for some reason (100% for each
>>>>> process
>>>>>     for 30 seconds), which is also unexpected as the remote side has
>>>>> closed the
>>>>>     socket.
>>>>>
>>>>> 7. Configuration file for process mode:
>>>>> global
>>>>> daemon
>>>>> maxconn 26000
>>>>> nbproc 48
>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
>>>>> # (and so on for 48 processes).
>>>>>
>>>>> defaults
>>>>> option http-keep-alive
>>>>> balance leastconn
>>>>> retries 2
>>>>> option redispatch
>>>>> maxconn 25000
>>>>> option splice-response
>>>>> option tcp-smart-accept
>>>>> option tcp-smart-connect
>>>>> option splice-auto
>>>>> timeout connect 5000ms
>>>>> timeout client 30000ms
>>>>> timeout server 30000ms
>>>>> timeout client-fin 30000ms
>>>>> timeout http-request 10000ms
>>>>> timeout http-keep-alive 75000ms
>>>>> timeout queue 10000ms
>>>>> timeout tarpit 15000ms
>>>>>
>>>>> frontend fk-fe-upgrade-80
>>>>> mode http
>>>>> default_backend fk-be-upgrade
>>>>> bind <VIP>:80 process 1
>>>>> # (and so on for 48 processes).
>>>>>
>>>>> backend fk-be-upgrade
>>>>> mode http
>>>>> default-server maxconn 2000 slowstart
>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>>>
>>>>> 8. Configuration file for thread mode:
>>>>> global
>>>>>         daemon
>>>>> maxconn 26000
>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin
>>>>> nbproc 1
>>>>> nbthread 48
>>>>> # cpu-map auto:1/1-48 0-39
>>>>>
>>>>> defaults
>>>>> option http-keep-alive
>>>>> balance leastconn
>>>>> retries 2
>>>>> option redispatch
>>>>> maxconn 25000
>>>>> option splice-response
>>>>> option tcp-smart-accept
>>>>> option tcp-smart-connect
>>>>> option splice-auto
>>>>> timeout connect 5000ms
>>>>> timeout client 30000ms
>>>>> timeout server 30000ms
>>>>> timeout client-fin 30000ms
>>>>> timeout http-request 10000ms
>>>>> timeout http-keep-alive 75000ms
>>>>> timeout queue 10000ms
>>>>> timeout tarpit 15000ms
>>>>>
>>>>> frontend fk-fe-upgrade-80
>>>>> mode http
>>>>> bind <VIP>:80 process 1/1-48
>>>>> default_backend fk-be-upgrade
>>>>>
>>>>> backend fk-be-upgrade
>>>>> mode http
>>>>> default-server maxconn 2000 slowstart
>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>>>
>>>>> I had also captured 'perf' output for the system for thread vs
>>>>> processes,
>>>>> can send it later if required.
>>>>>
>>>>> Thanks,
>>>>> - Krishna
>>>>>
>>>>

Reply via email to