what I going to try (when I will have some spare time) is sampling with google perftools
https://github.com/gperftools/gperftools they are great in cpu profiling. you can try them youself if you have time/wish :) чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) < [email protected]>: > 1. haproxy config: Same as given above (both processes and threads were > given in the mail) > 2. nginx: default, no changes. > 3. sysctl's: nothing set. All changes as described earlier (e.g. > irqbalance, irq pinning, etc). > 4. nf_conntrack: disabled > 5. dmesg: no messages. > > With the same system and settings, threads gives 18x lesser RPS than > processes, along with > the other 2 issues given in my mail today. > > > On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин <[email protected]> wrote: > >> haproxy config, nginx config >> non default sysctl (if any) >> >> as a side note, can you have a look at "dmesg" output ? do you have nf >> conntrack enabled ? what are its limits ? >> >> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) < >> [email protected]>: >> >>> Sure. >>> >>> 1. Client: Use one of the following two setup's: >>> - a single baremetal (48 core, 40g) system >>> Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or, >>> - 100 2 core vm's. >>> Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from >>> each VM and summarize the results using some >>> parallel-ssh setup. >>> >>> 2. HAProxy running on a single baremetal (same system config >>> as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different >>> cores of the same NUMA node for each irq, kill irqbalance, with >>> haproxy configuration file as given in my first mail. Around 60 >>> backend servers are configured in haproxy. >>> >>> 3. Backend servers are 2 core VM's running nginx and serving >>> a file called "/128", which is 128 bytes in size. >>> >>> Let me know if you need more information. >>> >>> Thanks, >>> - Krishna >>> >>> >>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин <[email protected]> >>> wrote: >>> >>>> load testing is somewhat good. >>>> can you describe an overall setup ? (I want to reproduce and play with >>>> it) >>>> >>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < >>>> [email protected]>: >>>> >>>>> Re-sending in case this mail was missed. To summarise the 3 issues >>>>> seen: >>>>> >>>>> 1. Performance drops 18x with higher number of nbthreads as compared >>>>> to nbprocs. >>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds >>>>> (for 1.9-dev3 >>>>> for nbprocs and nbthreads). >>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains >>>>> in either >>>>> CLOSE-WAIT (towards clients) and ESTAB (towards the backend >>>>> servers), till >>>>> the server/client timeout expires. >>>>> >>>>> The tests for threads and processes were done on the same systems, so >>>>> there is >>>>> no difference in system parameters. >>>>> >>>>> Thanks, >>>>> - Krishna >>>>> >>>>> >>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Willy, and community developers, >>>>>> >>>>>> I am not sure if I am doing something wrong, but wanted to report >>>>>> some issues that I am seeing. Please let me know if this is a problem. >>>>>> >>>>>> 1. HAProxy system: >>>>>> Kernel: 4.17.13, >>>>>> CPU: 48 core E5-2670 v3 >>>>>> Memory: 128GB memory >>>>>> NIC: Mellanox 40g with IRQ pinning >>>>>> >>>>>> 2. Client, 48 core similar to server. Test command line: >>>>>> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128 >>>>>> >>>>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git >>>>>> checkout as of >>>>>> Oct 2nd). >>>>>> # haproxy-git -vv >>>>>> HA-Proxy version 1.9-dev3 2018/09/29 >>>>>> Copyright 2000-2018 Willy Tarreau <[email protected]> >>>>>> >>>>>> Build options : >>>>>> TARGET = linux2628 >>>>>> CPU = generic >>>>>> CC = gcc >>>>>> CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement >>>>>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare >>>>>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers >>>>>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits >>>>>> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >>>>>> >>>>>> Default settings : >>>>>> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = >>>>>> 200 >>>>>> >>>>>> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>>>>> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>>>>> OpenSSL library supports TLS extensions : yes >>>>>> OpenSSL library supports SNI : yes >>>>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >>>>>> Built with transparent proxy support using: IP_TRANSPARENT >>>>>> IPV6_TRANSPARENT IP_FREEBIND >>>>>> Encrypted password support via crypt(3): yes >>>>>> Built with multi-threading support. >>>>>> Built with PCRE version : 8.38 2015-11-23 >>>>>> Running on PCRE version : 8.38 2015-11-23 >>>>>> PCRE library supports JIT : no (USE_PCRE_JIT not set) >>>>>> Built with zlib version : 1.2.8 >>>>>> Running on zlib version : 1.2.8 >>>>>> Compression algorithms supported : identity("identity"), >>>>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip") >>>>>> Built with network namespace support. >>>>>> >>>>>> Available polling systems : >>>>>> epoll : pref=300, test result OK >>>>>> poll : pref=200, test result OK >>>>>> select : pref=150, test result OK >>>>>> Total: 3 (3 usable), will use epoll. >>>>>> >>>>>> Available multiplexer protocols : >>>>>> (protocols markes as <default> cannot be specified using 'proto' >>>>>> keyword) >>>>>> h2 : mode=HTTP side=FE >>>>>> <default> : mode=TCP|HTTP side=FE|BE >>>>>> >>>>>> Available filters : >>>>>> [SPOE] spoe >>>>>> [COMP] compression >>>>>> [TRACE] trace >>>>>> >>>>>> 4. HAProxy results for #processes and #threads >>>>>> # Threads-RPS Procs-RPS >>>>>> 1 20903 19280 >>>>>> 2 46400 51045 >>>>>> 4 96587 142801 >>>>>> 8 172224 254720 >>>>>> 12 210451 437488 >>>>>> 16 173034 437375 >>>>>> 24 79069 519367 >>>>>> 32 55607 586367 >>>>>> 48 31739 596148 >>>>>> >>>>>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot >>>>>> more time >>>>>> to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: >>>>>> Stats about Lock FD: >>>>>> # write lock : 143933900 >>>>>> # write unlock: 143933895 (-5) >>>>>> # wait time for write : 11370.245 msec >>>>>> # wait time for write/lock: 78.996 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock TASK_RQ: >>>>>> # write lock : 2062874 >>>>>> # write unlock: 2062875 (1) >>>>>> # wait time for write : 7820.234 msec >>>>>> # wait time for write/lock: 3790.941 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock TASK_WQ: >>>>>> # write lock : 2601227 >>>>>> # write unlock: 2601227 (0) >>>>>> # wait time for write : 5019.811 msec >>>>>> # wait time for write/lock: 1929.786 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock POOL: >>>>>> # write lock : 2823393 >>>>>> # write unlock: 2823393 (0) >>>>>> # wait time for write : 11984.706 msec >>>>>> # wait time for write/lock: 4244.788 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock LISTENER: >>>>>> # write lock : 184 >>>>>> # write unlock: 184 (0) >>>>>> # wait time for write : 0.011 msec >>>>>> # wait time for write/lock: 60.554 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock PROXY: >>>>>> # write lock : 291557 >>>>>> # write unlock: 291557 (0) >>>>>> # wait time for write : 109.694 msec >>>>>> # wait time for write/lock: 376.235 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock SERVER: >>>>>> # write lock : 1188511 >>>>>> # write unlock: 1188511 (0) >>>>>> # wait time for write : 854.171 msec >>>>>> # wait time for write/lock: 718.690 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock LBPRM: >>>>>> # write lock : 1184709 >>>>>> # write unlock: 1184709 (0) >>>>>> # wait time for write : 778.947 msec >>>>>> # wait time for write/lock: 657.501 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock BUF_WQ: >>>>>> # write lock : 669247 >>>>>> # write unlock: 669247 (0) >>>>>> # wait time for write : 252.265 msec >>>>>> # wait time for write/lock: 376.939 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock STRMS: >>>>>> # write lock : 9335 >>>>>> # write unlock: 9335 (0) >>>>>> # wait time for write : 0.910 msec >>>>>> # wait time for write/lock: 97.492 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> Stats about Lock VARS: >>>>>> # write lock : 901947 >>>>>> # write unlock: 901947 (0) >>>>>> # wait time for write : 299.224 msec >>>>>> # wait time for write/lock: 331.753 nsec >>>>>> # read lock : 0 >>>>>> # read unlock : 0 (0) >>>>>> # wait time for read : 0.000 msec >>>>>> # wait time for read/lock : 0.000 nsec >>>>>> >>>>>> 6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 >>>>>> runs >>>>>> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14, >>>>>> this behavior was not seen. Ran the following command for >>>>>> both: >>>>>> "ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n" >>>>>> 1.8.14 during test: >>>>>> 451 SYN-SENT >>>>>> 9166 ESTAB >>>>>> 1.8.14 after test: >>>>>> 2 ESTAB >>>>>> >>>>>> 1.9-dev3 during test: >>>>>> 109 SYN-SENT >>>>>> 9400 ESTAB >>>>>> 1.9-dev3 after test: >>>>>> 2185 CLOSE-WAIT >>>>>> 2187 ESTAB >>>>>> All connections that were in CLOSE-WAIT were from the client, >>>>>> while all >>>>>> connections in ESTAB state were to the server. This lasted for 30 >>>>>> seconds. >>>>>> On the client system, all sockets were in FIN-WAIT-2 state: >>>>>> 2186 FIN-WAIT-2 >>>>>> This (2185/2186) seems to imply that client closed the connection >>>>>> but >>>>>> haproxy did not close the socket for 30 seconds. This also >>>>>> results in >>>>>> high CPU utilization on haproxy for some reason (100% for each >>>>>> process >>>>>> for 30 seconds), which is also unexpected as the remote side has >>>>>> closed the >>>>>> socket. >>>>>> >>>>>> 7. Configuration file for process mode: >>>>>> global >>>>>> daemon >>>>>> maxconn 26000 >>>>>> nbproc 48 >>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1 >>>>>> # (and so on for 48 processes). >>>>>> >>>>>> defaults >>>>>> option http-keep-alive >>>>>> balance leastconn >>>>>> retries 2 >>>>>> option redispatch >>>>>> maxconn 25000 >>>>>> option splice-response >>>>>> option tcp-smart-accept >>>>>> option tcp-smart-connect >>>>>> option splice-auto >>>>>> timeout connect 5000ms >>>>>> timeout client 30000ms >>>>>> timeout server 30000ms >>>>>> timeout client-fin 30000ms >>>>>> timeout http-request 10000ms >>>>>> timeout http-keep-alive 75000ms >>>>>> timeout queue 10000ms >>>>>> timeout tarpit 15000ms >>>>>> >>>>>> frontend fk-fe-upgrade-80 >>>>>> mode http >>>>>> default_backend fk-be-upgrade >>>>>> bind <VIP>:80 process 1 >>>>>> # (and so on for 48 processes). >>>>>> >>>>>> backend fk-be-upgrade >>>>>> mode http >>>>>> default-server maxconn 2000 slowstart >>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>" >>>>>> >>>>>> 8. Configuration file for thread mode: >>>>>> global >>>>>> daemon >>>>>> maxconn 26000 >>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin >>>>>> nbproc 1 >>>>>> nbthread 48 >>>>>> # cpu-map auto:1/1-48 0-39 >>>>>> >>>>>> defaults >>>>>> option http-keep-alive >>>>>> balance leastconn >>>>>> retries 2 >>>>>> option redispatch >>>>>> maxconn 25000 >>>>>> option splice-response >>>>>> option tcp-smart-accept >>>>>> option tcp-smart-connect >>>>>> option splice-auto >>>>>> timeout connect 5000ms >>>>>> timeout client 30000ms >>>>>> timeout server 30000ms >>>>>> timeout client-fin 30000ms >>>>>> timeout http-request 10000ms >>>>>> timeout http-keep-alive 75000ms >>>>>> timeout queue 10000ms >>>>>> timeout tarpit 15000ms >>>>>> >>>>>> frontend fk-fe-upgrade-80 >>>>>> mode http >>>>>> bind <VIP>:80 process 1/1-48 >>>>>> default_backend fk-be-upgrade >>>>>> >>>>>> backend fk-be-upgrade >>>>>> mode http >>>>>> default-server maxconn 2000 slowstart >>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>" >>>>>> >>>>>> I had also captured 'perf' output for the system for thread vs >>>>>> processes, >>>>>> can send it later if required. >>>>>> >>>>>> Thanks, >>>>>> - Krishna >>>>>> >>>>>

