Sure.

1. Client: Use one of the following two setup's:
        - a single baremetal (48 core, 40g) system
          Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
        - 100 2 core vm's.
          Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
          each VM and summarize the results using some
          parallel-ssh setup.

2. HAProxy running on a single baremetal (same system config
    as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
    cores of the same NUMA node for each irq, kill irqbalance, with
    haproxy configuration file as given in my first mail. Around 60
    backend servers are configured in haproxy.

3. Backend servers are 2 core VM's running nginx and serving
    a file called "/128", which is 128 bytes in size.

Let me know if you need more information.

Thanks,
- Krishna


On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин <chipits...@gmail.com> wrote:

> load testing is somewhat good.
> can you describe an overall setup ? (I want to reproduce and play with it)
>
> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
> krishna...@flipkart.com>:
>
>> Re-sending in case this mail was missed. To summarise the 3 issues seen:
>>
>> 1. Performance drops 18x with higher number of nbthreads as compared to
>> nbprocs.
>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
>> 1.9-dev3
>>     for nbprocs and nbthreads).
>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
>> either
>>      CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>> servers), till
>>      the server/client timeout expires.
>>
>> The tests for threads and processes were done on the same systems, so
>> there is
>> no difference in system parameters.
>>
>> Thanks,
>> - Krishna
>>
>>
>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>> krishna...@flipkart.com> wrote:
>>
>>> Hi Willy, and community developers,
>>>
>>> I am not sure if I am doing something wrong, but wanted to report
>>> some issues that I am seeing. Please let me know if this is a problem.
>>>
>>> 1. HAProxy system:
>>> Kernel: 4.17.13,
>>> CPU: 48 core E5-2670 v3
>>> Memory: 128GB memory
>>> NIC: Mellanox 40g with IRQ pinning
>>>
>>> 2. Client, 48 core similar to server. Test command line:
>>> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
>>>
>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
>>> as of
>>>     Oct 2nd).
>>> # haproxy-git -vv
>>> HA-Proxy version 1.9-dev3 2018/09/29
>>> Copyright 2000-2018 Willy Tarreau <wi...@haproxy.org>
>>>
>>> Build options :
>>>   TARGET  = linux2628
>>>   CPU     = generic
>>>   CC      = gcc
>>>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
>>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
>>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
>>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>>
>>> Default settings :
>>>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>>>
>>> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>> OpenSSL library supports TLS extensions : yes
>>> OpenSSL library supports SNI : yes
>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
>>> Built with transparent proxy support using: IP_TRANSPARENT
>>> IPV6_TRANSPARENT IP_FREEBIND
>>> Encrypted password support via crypt(3): yes
>>> Built with multi-threading support.
>>> Built with PCRE version : 8.38 2015-11-23
>>> Running on PCRE version : 8.38 2015-11-23
>>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>>> Built with zlib version : 1.2.8
>>> Running on zlib version : 1.2.8
>>> Compression algorithms supported : identity("identity"),
>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
>>> Built with network namespace support.
>>>
>>> Available polling systems :
>>>       epoll : pref=300,  test result OK
>>>        poll : pref=200,  test result OK
>>>      select : pref=150,  test result OK
>>> Total: 3 (3 usable), will use epoll.
>>>
>>> Available multiplexer protocols :
>>> (protocols markes as <default> cannot be specified using 'proto' keyword)
>>>               h2 : mode=HTTP       side=FE
>>>        <default> : mode=TCP|HTTP   side=FE|BE
>>>
>>> Available filters :
>>> [SPOE] spoe
>>> [COMP] compression
>>> [TRACE] trace
>>>
>>> 4. HAProxy results for #processes and #threads
>>> #    Threads-RPS Procs-RPS
>>> 1 20903 19280
>>> 2 46400 51045
>>> 4 96587 142801
>>> 8 172224 254720
>>> 12 210451 437488
>>> 16 173034 437375
>>> 24 79069 519367
>>> 32 55607 586367
>>> 48 31739 596148
>>>
>>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more
>>> time
>>>    to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
>>> Stats about Lock FD:
>>> # write lock  : 143933900
>>> # write unlock: 143933895 (-5)
>>> # wait time for write     : 11370.245 msec
>>> # wait time for write/lock: 78.996 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock TASK_RQ:
>>> # write lock  : 2062874
>>> # write unlock: 2062875 (1)
>>> # wait time for write     : 7820.234 msec
>>> # wait time for write/lock: 3790.941 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock TASK_WQ:
>>> # write lock  : 2601227
>>> # write unlock: 2601227 (0)
>>> # wait time for write     : 5019.811 msec
>>> # wait time for write/lock: 1929.786 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock POOL:
>>> # write lock  : 2823393
>>> # write unlock: 2823393 (0)
>>> # wait time for write     : 11984.706 msec
>>> # wait time for write/lock: 4244.788 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock LISTENER:
>>> # write lock  : 184
>>> # write unlock: 184 (0)
>>> # wait time for write     : 0.011 msec
>>> # wait time for write/lock: 60.554 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock PROXY:
>>> # write lock  : 291557
>>> # write unlock: 291557 (0)
>>> # wait time for write     : 109.694 msec
>>> # wait time for write/lock: 376.235 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock SERVER:
>>> # write lock  : 1188511
>>> # write unlock: 1188511 (0)
>>> # wait time for write     : 854.171 msec
>>> # wait time for write/lock: 718.690 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock LBPRM:
>>> # write lock  : 1184709
>>> # write unlock: 1184709 (0)
>>> # wait time for write     : 778.947 msec
>>> # wait time for write/lock: 657.501 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock BUF_WQ:
>>> # write lock  : 669247
>>> # write unlock: 669247 (0)
>>> # wait time for write     : 252.265 msec
>>> # wait time for write/lock: 376.939 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock STRMS:
>>> # write lock  : 9335
>>> # write unlock: 9335 (0)
>>> # wait time for write     : 0.910 msec
>>> # wait time for write/lock: 97.492 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>> Stats about Lock VARS:
>>> # write lock  : 901947
>>> # write unlock: 901947 (0)
>>> # wait time for write     : 299.224 msec
>>> # wait time for write/lock: 331.753 nsec
>>> # read lock   : 0
>>> # read unlock : 0 (0)
>>> # wait time for read      : 0.000 msec
>>> # wait time for read/lock : 0.000 nsec
>>>
>>> 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3
>>> runs
>>> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
>>>         this behavior was not seen. Ran the following command for both:
>>> "ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
>>>     1.8.14 during test:
>>> 451 SYN-SENT
>>> 9166 ESTAB
>>>     1.8.14 after test:
>>> 2 ESTAB
>>>
>>>     1.9-dev3 during test:
>>> 109 SYN-SENT
>>> 9400 ESTAB
>>>     1.9-dev3 after test:
>>> 2185 CLOSE-WAIT
>>> 2187 ESTAB
>>>     All connections that were in CLOSE-WAIT were from the client, while
>>> all
>>>     connections in ESTAB state were to the server. This lasted for 30
>>> seconds.
>>>     On the client system, all sockets were in FIN-WAIT-2 state:
>>>     2186 FIN-WAIT-2
>>>     This (2185/2186) seems to imply that client closed the connection but
>>>     haproxy did not close the socket for 30 seconds. This also results in
>>>     high CPU utilization on haproxy for some reason (100% for each
>>> process
>>>     for 30 seconds), which is also unexpected as the remote side has
>>> closed the
>>>     socket.
>>>
>>> 7. Configuration file for process mode:
>>> global
>>> daemon
>>> maxconn 26000
>>> nbproc 48
>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
>>> # (and so on for 48 processes).
>>>
>>> defaults
>>> option http-keep-alive
>>> balance leastconn
>>> retries 2
>>> option redispatch
>>> maxconn 25000
>>> option splice-response
>>> option tcp-smart-accept
>>> option tcp-smart-connect
>>> option splice-auto
>>> timeout connect 5000ms
>>> timeout client 30000ms
>>> timeout server 30000ms
>>> timeout client-fin 30000ms
>>> timeout http-request 10000ms
>>> timeout http-keep-alive 75000ms
>>> timeout queue 10000ms
>>> timeout tarpit 15000ms
>>>
>>> frontend fk-fe-upgrade-80
>>> mode http
>>> default_backend fk-be-upgrade
>>> bind <VIP>:80 process 1
>>> # (and so on for 48 processes).
>>>
>>> backend fk-be-upgrade
>>> mode http
>>> default-server maxconn 2000 slowstart
>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>
>>> 8. Configuration file for thread mode:
>>> global
>>>         daemon
>>> maxconn 26000
>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin
>>> nbproc 1
>>> nbthread 48
>>> # cpu-map auto:1/1-48 0-39
>>>
>>> defaults
>>> option http-keep-alive
>>> balance leastconn
>>> retries 2
>>> option redispatch
>>> maxconn 25000
>>> option splice-response
>>> option tcp-smart-accept
>>> option tcp-smart-connect
>>> option splice-auto
>>> timeout connect 5000ms
>>> timeout client 30000ms
>>> timeout server 30000ms
>>> timeout client-fin 30000ms
>>> timeout http-request 10000ms
>>> timeout http-keep-alive 75000ms
>>> timeout queue 10000ms
>>> timeout tarpit 15000ms
>>>
>>> frontend fk-fe-upgrade-80
>>> mode http
>>> bind <VIP>:80 process 1/1-48
>>> default_backend fk-be-upgrade
>>>
>>> backend fk-be-upgrade
>>> mode http
>>> default-server maxconn 2000 slowstart
>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>
>>> I had also captured 'perf' output for the system for thread vs processes,
>>> can send it later if required.
>>>
>>> Thanks,
>>> - Krishna
>>>
>>

Reply via email to