what I going to try (when I will have some spare time) is sampling with
google perftools

https://github.com/gperftools/gperftools

they are great in cpu profiling.
you can try them youself if you have time/wish :)


чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) <
[email protected]>:

> 1. haproxy config: Same as given above (both processes and threads were
> given in the mail)
> 2. nginx: default, no changes.
> 3. sysctl's: nothing set. All changes as described earlier (e.g.
> irqbalance, irq pinning, etc).
> 4. nf_conntrack: disabled
> 5. dmesg: no messages.
>
> With the same system and settings, threads gives 18x lesser RPS than
> processes, along with
> the other 2 issues given in my mail today.
>
>
> On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин <[email protected]> wrote:
>
>> haproxy config, nginx config
>> non default sysctl (if any)
>>
>> as a side note, can you have a look at "dmesg" output ? do you have nf
>> conntrack enabled ? what are its limits ?
>>
>> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
>> [email protected]>:
>>
>>> Sure.
>>>
>>> 1. Client: Use one of the following two setup's:
>>>         - a single baremetal (48 core, 40g) system
>>>           Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
>>>         - 100 2 core vm's.
>>>           Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
>>>           each VM and summarize the results using some
>>>           parallel-ssh setup.
>>>
>>> 2. HAProxy running on a single baremetal (same system config
>>>     as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
>>>     cores of the same NUMA node for each irq, kill irqbalance, with
>>>     haproxy configuration file as given in my first mail. Around 60
>>>     backend servers are configured in haproxy.
>>>
>>> 3. Backend servers are 2 core VM's running nginx and serving
>>>     a file called "/128", which is 128 bytes in size.
>>>
>>> Let me know if you need more information.
>>>
>>> Thanks,
>>> - Krishna
>>>
>>>
>>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин <[email protected]>
>>> wrote:
>>>
>>>> load testing is somewhat good.
>>>> can you describe an overall setup ? (I want to reproduce and play with
>>>> it)
>>>>
>>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
>>>> [email protected]>:
>>>>
>>>>> Re-sending in case this mail was missed. To summarise the 3 issues
>>>>> seen:
>>>>>
>>>>> 1. Performance drops 18x with higher number of nbthreads as compared
>>>>> to nbprocs.
>>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
>>>>> (for 1.9-dev3
>>>>>     for nbprocs and nbthreads).
>>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains
>>>>> in either
>>>>>      CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>>>>> servers), till
>>>>>      the server/client timeout expires.
>>>>>
>>>>> The tests for threads and processes were done on the same systems, so
>>>>> there is
>>>>> no difference in system parameters.
>>>>>
>>>>> Thanks,
>>>>> - Krishna
>>>>>
>>>>>
>>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Willy, and community developers,
>>>>>>
>>>>>> I am not sure if I am doing something wrong, but wanted to report
>>>>>> some issues that I am seeing. Please let me know if this is a problem.
>>>>>>
>>>>>> 1. HAProxy system:
>>>>>> Kernel: 4.17.13,
>>>>>> CPU: 48 core E5-2670 v3
>>>>>> Memory: 128GB memory
>>>>>> NIC: Mellanox 40g with IRQ pinning
>>>>>>
>>>>>> 2. Client, 48 core similar to server. Test command line:
>>>>>> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
>>>>>>
>>>>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
>>>>>> checkout as of
>>>>>>     Oct 2nd).
>>>>>> # haproxy-git -vv
>>>>>> HA-Proxy version 1.9-dev3 2018/09/29
>>>>>> Copyright 2000-2018 Willy Tarreau <[email protected]>
>>>>>>
>>>>>> Build options :
>>>>>>   TARGET  = linux2628
>>>>>>   CPU     = generic
>>>>>>   CC      = gcc
>>>>>>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
>>>>>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
>>>>>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
>>>>>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>>>>>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>>>>>
>>>>>> Default settings :
>>>>>>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents =
>>>>>> 200
>>>>>>
>>>>>> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>>>>> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>>>>> OpenSSL library supports TLS extensions : yes
>>>>>> OpenSSL library supports SNI : yes
>>>>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
>>>>>> Built with transparent proxy support using: IP_TRANSPARENT
>>>>>> IPV6_TRANSPARENT IP_FREEBIND
>>>>>> Encrypted password support via crypt(3): yes
>>>>>> Built with multi-threading support.
>>>>>> Built with PCRE version : 8.38 2015-11-23
>>>>>> Running on PCRE version : 8.38 2015-11-23
>>>>>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>>>>>> Built with zlib version : 1.2.8
>>>>>> Running on zlib version : 1.2.8
>>>>>> Compression algorithms supported : identity("identity"),
>>>>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
>>>>>> Built with network namespace support.
>>>>>>
>>>>>> Available polling systems :
>>>>>>       epoll : pref=300,  test result OK
>>>>>>        poll : pref=200,  test result OK
>>>>>>      select : pref=150,  test result OK
>>>>>> Total: 3 (3 usable), will use epoll.
>>>>>>
>>>>>> Available multiplexer protocols :
>>>>>> (protocols markes as <default> cannot be specified using 'proto'
>>>>>> keyword)
>>>>>>               h2 : mode=HTTP       side=FE
>>>>>>        <default> : mode=TCP|HTTP   side=FE|BE
>>>>>>
>>>>>> Available filters :
>>>>>> [SPOE] spoe
>>>>>> [COMP] compression
>>>>>> [TRACE] trace
>>>>>>
>>>>>> 4. HAProxy results for #processes and #threads
>>>>>> #    Threads-RPS Procs-RPS
>>>>>> 1 20903 19280
>>>>>> 2 46400 51045
>>>>>> 4 96587 142801
>>>>>> 8 172224 254720
>>>>>> 12 210451 437488
>>>>>> 16 173034 437375
>>>>>> 24 79069 519367
>>>>>> 32 55607 586367
>>>>>> 48 31739 596148
>>>>>>
>>>>>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot
>>>>>> more time
>>>>>>    to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
>>>>>> Stats about Lock FD:
>>>>>> # write lock  : 143933900
>>>>>> # write unlock: 143933895 (-5)
>>>>>> # wait time for write     : 11370.245 msec
>>>>>> # wait time for write/lock: 78.996 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock TASK_RQ:
>>>>>> # write lock  : 2062874
>>>>>> # write unlock: 2062875 (1)
>>>>>> # wait time for write     : 7820.234 msec
>>>>>> # wait time for write/lock: 3790.941 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock TASK_WQ:
>>>>>> # write lock  : 2601227
>>>>>> # write unlock: 2601227 (0)
>>>>>> # wait time for write     : 5019.811 msec
>>>>>> # wait time for write/lock: 1929.786 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock POOL:
>>>>>> # write lock  : 2823393
>>>>>> # write unlock: 2823393 (0)
>>>>>> # wait time for write     : 11984.706 msec
>>>>>> # wait time for write/lock: 4244.788 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock LISTENER:
>>>>>> # write lock  : 184
>>>>>> # write unlock: 184 (0)
>>>>>> # wait time for write     : 0.011 msec
>>>>>> # wait time for write/lock: 60.554 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock PROXY:
>>>>>> # write lock  : 291557
>>>>>> # write unlock: 291557 (0)
>>>>>> # wait time for write     : 109.694 msec
>>>>>> # wait time for write/lock: 376.235 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock SERVER:
>>>>>> # write lock  : 1188511
>>>>>> # write unlock: 1188511 (0)
>>>>>> # wait time for write     : 854.171 msec
>>>>>> # wait time for write/lock: 718.690 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock LBPRM:
>>>>>> # write lock  : 1184709
>>>>>> # write unlock: 1184709 (0)
>>>>>> # wait time for write     : 778.947 msec
>>>>>> # wait time for write/lock: 657.501 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock BUF_WQ:
>>>>>> # write lock  : 669247
>>>>>> # write unlock: 669247 (0)
>>>>>> # wait time for write     : 252.265 msec
>>>>>> # wait time for write/lock: 376.939 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock STRMS:
>>>>>> # write lock  : 9335
>>>>>> # write unlock: 9335 (0)
>>>>>> # wait time for write     : 0.910 msec
>>>>>> # wait time for write/lock: 97.492 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>> Stats about Lock VARS:
>>>>>> # write lock  : 901947
>>>>>> # write unlock: 901947 (0)
>>>>>> # wait time for write     : 299.224 msec
>>>>>> # wait time for write/lock: 331.753 nsec
>>>>>> # read lock   : 0
>>>>>> # read unlock : 0 (0)
>>>>>> # wait time for read      : 0.000 msec
>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>
>>>>>> 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3
>>>>>> runs
>>>>>> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
>>>>>>         this behavior was not seen. Ran the following command for
>>>>>> both:
>>>>>> "ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
>>>>>>     1.8.14 during test:
>>>>>> 451 SYN-SENT
>>>>>> 9166 ESTAB
>>>>>>     1.8.14 after test:
>>>>>> 2 ESTAB
>>>>>>
>>>>>>     1.9-dev3 during test:
>>>>>> 109 SYN-SENT
>>>>>> 9400 ESTAB
>>>>>>     1.9-dev3 after test:
>>>>>> 2185 CLOSE-WAIT
>>>>>> 2187 ESTAB
>>>>>>     All connections that were in CLOSE-WAIT were from the client,
>>>>>> while all
>>>>>>     connections in ESTAB state were to the server. This lasted for 30
>>>>>> seconds.
>>>>>>     On the client system, all sockets were in FIN-WAIT-2 state:
>>>>>>     2186 FIN-WAIT-2
>>>>>>     This (2185/2186) seems to imply that client closed the connection
>>>>>> but
>>>>>>     haproxy did not close the socket for 30 seconds. This also
>>>>>> results in
>>>>>>     high CPU utilization on haproxy for some reason (100% for each
>>>>>> process
>>>>>>     for 30 seconds), which is also unexpected as the remote side has
>>>>>> closed the
>>>>>>     socket.
>>>>>>
>>>>>> 7. Configuration file for process mode:
>>>>>> global
>>>>>> daemon
>>>>>> maxconn 26000
>>>>>> nbproc 48
>>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
>>>>>> # (and so on for 48 processes).
>>>>>>
>>>>>> defaults
>>>>>> option http-keep-alive
>>>>>> balance leastconn
>>>>>> retries 2
>>>>>> option redispatch
>>>>>> maxconn 25000
>>>>>> option splice-response
>>>>>> option tcp-smart-accept
>>>>>> option tcp-smart-connect
>>>>>> option splice-auto
>>>>>> timeout connect 5000ms
>>>>>> timeout client 30000ms
>>>>>> timeout server 30000ms
>>>>>> timeout client-fin 30000ms
>>>>>> timeout http-request 10000ms
>>>>>> timeout http-keep-alive 75000ms
>>>>>> timeout queue 10000ms
>>>>>> timeout tarpit 15000ms
>>>>>>
>>>>>> frontend fk-fe-upgrade-80
>>>>>> mode http
>>>>>> default_backend fk-be-upgrade
>>>>>> bind <VIP>:80 process 1
>>>>>> # (and so on for 48 processes).
>>>>>>
>>>>>> backend fk-be-upgrade
>>>>>> mode http
>>>>>> default-server maxconn 2000 slowstart
>>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>>>>
>>>>>> 8. Configuration file for thread mode:
>>>>>> global
>>>>>>         daemon
>>>>>> maxconn 26000
>>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin
>>>>>> nbproc 1
>>>>>> nbthread 48
>>>>>> # cpu-map auto:1/1-48 0-39
>>>>>>
>>>>>> defaults
>>>>>> option http-keep-alive
>>>>>> balance leastconn
>>>>>> retries 2
>>>>>> option redispatch
>>>>>> maxconn 25000
>>>>>> option splice-response
>>>>>> option tcp-smart-accept
>>>>>> option tcp-smart-connect
>>>>>> option splice-auto
>>>>>> timeout connect 5000ms
>>>>>> timeout client 30000ms
>>>>>> timeout server 30000ms
>>>>>> timeout client-fin 30000ms
>>>>>> timeout http-request 10000ms
>>>>>> timeout http-keep-alive 75000ms
>>>>>> timeout queue 10000ms
>>>>>> timeout tarpit 15000ms
>>>>>>
>>>>>> frontend fk-fe-upgrade-80
>>>>>> mode http
>>>>>> bind <VIP>:80 process 1/1-48
>>>>>> default_backend fk-be-upgrade
>>>>>>
>>>>>> backend fk-be-upgrade
>>>>>> mode http
>>>>>> default-server maxconn 2000 slowstart
>>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>>>>
>>>>>> I had also captured 'perf' output for the system for thread vs
>>>>>> processes,
>>>>>> can send it later if required.
>>>>>>
>>>>>> Thanks,
>>>>>> - Krishna
>>>>>>
>>>>>

Reply via email to