Thanks, will take a look! On Thu, Oct 4, 2018 at 12:58 PM Илья Шипицин <chipits...@gmail.com> wrote:
> what I going to try (when I will have some spare time) is sampling with > google perftools > > https://github.com/gperftools/gperftools > > they are great in cpu profiling. > you can try them youself if you have time/wish :) > > > чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) < > krishna...@flipkart.com>: > >> 1. haproxy config: Same as given above (both processes and threads were >> given in the mail) >> 2. nginx: default, no changes. >> 3. sysctl's: nothing set. All changes as described earlier (e.g. >> irqbalance, irq pinning, etc). >> 4. nf_conntrack: disabled >> 5. dmesg: no messages. >> >> With the same system and settings, threads gives 18x lesser RPS than >> processes, along with >> the other 2 issues given in my mail today. >> >> >> On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин <chipits...@gmail.com> >> wrote: >> >>> haproxy config, nginx config >>> non default sysctl (if any) >>> >>> as a side note, can you have a look at "dmesg" output ? do you have nf >>> conntrack enabled ? what are its limits ? >>> >>> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) < >>> krishna...@flipkart.com>: >>> >>>> Sure. >>>> >>>> 1. Client: Use one of the following two setup's: >>>> - a single baremetal (48 core, 40g) system >>>> Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or, >>>> - 100 2 core vm's. >>>> Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from >>>> each VM and summarize the results using some >>>> parallel-ssh setup. >>>> >>>> 2. HAProxy running on a single baremetal (same system config >>>> as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different >>>> cores of the same NUMA node for each irq, kill irqbalance, with >>>> haproxy configuration file as given in my first mail. Around 60 >>>> backend servers are configured in haproxy. >>>> >>>> 3. Backend servers are 2 core VM's running nginx and serving >>>> a file called "/128", which is 128 bytes in size. >>>> >>>> Let me know if you need more information. >>>> >>>> Thanks, >>>> - Krishna >>>> >>>> >>>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин <chipits...@gmail.com> >>>> wrote: >>>> >>>>> load testing is somewhat good. >>>>> can you describe an overall setup ? (I want to reproduce and play with >>>>> it) >>>>> >>>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < >>>>> krishna...@flipkart.com>: >>>>> >>>>>> Re-sending in case this mail was missed. To summarise the 3 issues >>>>>> seen: >>>>>> >>>>>> 1. Performance drops 18x with higher number of nbthreads as compared >>>>>> to nbprocs. >>>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds >>>>>> (for 1.9-dev3 >>>>>> for nbprocs and nbthreads). >>>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it >>>>>> remains in either >>>>>> CLOSE-WAIT (towards clients) and ESTAB (towards the backend >>>>>> servers), till >>>>>> the server/client timeout expires. >>>>>> >>>>>> The tests for threads and processes were done on the same systems, so >>>>>> there is >>>>>> no difference in system parameters. >>>>>> >>>>>> Thanks, >>>>>> - Krishna >>>>>> >>>>>> >>>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < >>>>>> krishna...@flipkart.com> wrote: >>>>>> >>>>>>> Hi Willy, and community developers, >>>>>>> >>>>>>> I am not sure if I am doing something wrong, but wanted to report >>>>>>> some issues that I am seeing. Please let me know if this is a >>>>>>> problem. >>>>>>> >>>>>>> 1. HAProxy system: >>>>>>> Kernel: 4.17.13, >>>>>>> CPU: 48 core E5-2670 v3 >>>>>>> Memory: 128GB memory >>>>>>> NIC: Mellanox 40g with IRQ pinning >>>>>>> >>>>>>> 2. Client, 48 core similar to server. Test command line: >>>>>>> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128 >>>>>>> >>>>>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git >>>>>>> checkout as of >>>>>>> Oct 2nd). >>>>>>> # haproxy-git -vv >>>>>>> HA-Proxy version 1.9-dev3 2018/09/29 >>>>>>> Copyright 2000-2018 Willy Tarreau <wi...@haproxy.org> >>>>>>> >>>>>>> Build options : >>>>>>> TARGET = linux2628 >>>>>>> CPU = generic >>>>>>> CC = gcc >>>>>>> CFLAGS = -O2 -g -fno-strict-aliasing >>>>>>> -Wdeclaration-after-statement -fwrapv -fno-strict-overflow >>>>>>> -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter >>>>>>> -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered >>>>>>> -Wno-missing-field-initializers -Wtype-limits >>>>>>> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >>>>>>> >>>>>>> Default settings : >>>>>>> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents >>>>>>> = 200 >>>>>>> >>>>>>> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>>>>>> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>>>>>> OpenSSL library supports TLS extensions : yes >>>>>>> OpenSSL library supports SNI : yes >>>>>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >>>>>>> Built with transparent proxy support using: IP_TRANSPARENT >>>>>>> IPV6_TRANSPARENT IP_FREEBIND >>>>>>> Encrypted password support via crypt(3): yes >>>>>>> Built with multi-threading support. >>>>>>> Built with PCRE version : 8.38 2015-11-23 >>>>>>> Running on PCRE version : 8.38 2015-11-23 >>>>>>> PCRE library supports JIT : no (USE_PCRE_JIT not set) >>>>>>> Built with zlib version : 1.2.8 >>>>>>> Running on zlib version : 1.2.8 >>>>>>> Compression algorithms supported : identity("identity"), >>>>>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip") >>>>>>> Built with network namespace support. >>>>>>> >>>>>>> Available polling systems : >>>>>>> epoll : pref=300, test result OK >>>>>>> poll : pref=200, test result OK >>>>>>> select : pref=150, test result OK >>>>>>> Total: 3 (3 usable), will use epoll. >>>>>>> >>>>>>> Available multiplexer protocols : >>>>>>> (protocols markes as <default> cannot be specified using 'proto' >>>>>>> keyword) >>>>>>> h2 : mode=HTTP side=FE >>>>>>> <default> : mode=TCP|HTTP side=FE|BE >>>>>>> >>>>>>> Available filters : >>>>>>> [SPOE] spoe >>>>>>> [COMP] compression >>>>>>> [TRACE] trace >>>>>>> >>>>>>> 4. HAProxy results for #processes and #threads >>>>>>> # Threads-RPS Procs-RPS >>>>>>> 1 20903 19280 >>>>>>> 2 46400 51045 >>>>>>> 4 96587 142801 >>>>>>> 8 172224 254720 >>>>>>> 12 210451 437488 >>>>>>> 16 173034 437375 >>>>>>> 24 79069 519367 >>>>>>> 32 55607 586367 >>>>>>> 48 31739 596148 >>>>>>> >>>>>>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot >>>>>>> more time >>>>>>> to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: >>>>>>> Stats about Lock FD: >>>>>>> # write lock : 143933900 >>>>>>> # write unlock: 143933895 (-5) >>>>>>> # wait time for write : 11370.245 msec >>>>>>> # wait time for write/lock: 78.996 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock TASK_RQ: >>>>>>> # write lock : 2062874 >>>>>>> # write unlock: 2062875 (1) >>>>>>> # wait time for write : 7820.234 msec >>>>>>> # wait time for write/lock: 3790.941 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock TASK_WQ: >>>>>>> # write lock : 2601227 >>>>>>> # write unlock: 2601227 (0) >>>>>>> # wait time for write : 5019.811 msec >>>>>>> # wait time for write/lock: 1929.786 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock POOL: >>>>>>> # write lock : 2823393 >>>>>>> # write unlock: 2823393 (0) >>>>>>> # wait time for write : 11984.706 msec >>>>>>> # wait time for write/lock: 4244.788 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock LISTENER: >>>>>>> # write lock : 184 >>>>>>> # write unlock: 184 (0) >>>>>>> # wait time for write : 0.011 msec >>>>>>> # wait time for write/lock: 60.554 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock PROXY: >>>>>>> # write lock : 291557 >>>>>>> # write unlock: 291557 (0) >>>>>>> # wait time for write : 109.694 msec >>>>>>> # wait time for write/lock: 376.235 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock SERVER: >>>>>>> # write lock : 1188511 >>>>>>> # write unlock: 1188511 (0) >>>>>>> # wait time for write : 854.171 msec >>>>>>> # wait time for write/lock: 718.690 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock LBPRM: >>>>>>> # write lock : 1184709 >>>>>>> # write unlock: 1184709 (0) >>>>>>> # wait time for write : 778.947 msec >>>>>>> # wait time for write/lock: 657.501 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock BUF_WQ: >>>>>>> # write lock : 669247 >>>>>>> # write unlock: 669247 (0) >>>>>>> # wait time for write : 252.265 msec >>>>>>> # wait time for write/lock: 376.939 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock STRMS: >>>>>>> # write lock : 9335 >>>>>>> # write unlock: 9335 (0) >>>>>>> # wait time for write : 0.910 msec >>>>>>> # wait time for write/lock: 97.492 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> Stats about Lock VARS: >>>>>>> # write lock : 901947 >>>>>>> # write unlock: 901947 (0) >>>>>>> # wait time for write : 299.224 msec >>>>>>> # wait time for write/lock: 331.753 nsec >>>>>>> # read lock : 0 >>>>>>> # read unlock : 0 (0) >>>>>>> # wait time for read : 0.000 msec >>>>>>> # wait time for read/lock : 0.000 nsec >>>>>>> >>>>>>> 6. CPU utilization after test for processes/threads: >>>>>>> haproxy-1.9-dev3 runs >>>>>>> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14, >>>>>>> this behavior was not seen. Ran the following command for >>>>>>> both: >>>>>>> "ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n" >>>>>>> 1.8.14 during test: >>>>>>> 451 SYN-SENT >>>>>>> 9166 ESTAB >>>>>>> 1.8.14 after test: >>>>>>> 2 ESTAB >>>>>>> >>>>>>> 1.9-dev3 during test: >>>>>>> 109 SYN-SENT >>>>>>> 9400 ESTAB >>>>>>> 1.9-dev3 after test: >>>>>>> 2185 CLOSE-WAIT >>>>>>> 2187 ESTAB >>>>>>> All connections that were in CLOSE-WAIT were from the client, >>>>>>> while all >>>>>>> connections in ESTAB state were to the server. This lasted for >>>>>>> 30 seconds. >>>>>>> On the client system, all sockets were in FIN-WAIT-2 state: >>>>>>> 2186 FIN-WAIT-2 >>>>>>> This (2185/2186) seems to imply that client closed the >>>>>>> connection but >>>>>>> haproxy did not close the socket for 30 seconds. This also >>>>>>> results in >>>>>>> high CPU utilization on haproxy for some reason (100% for each >>>>>>> process >>>>>>> for 30 seconds), which is also unexpected as the remote side has >>>>>>> closed the >>>>>>> socket. >>>>>>> >>>>>>> 7. Configuration file for process mode: >>>>>>> global >>>>>>> daemon >>>>>>> maxconn 26000 >>>>>>> nbproc 48 >>>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1 >>>>>>> # (and so on for 48 processes). >>>>>>> >>>>>>> defaults >>>>>>> option http-keep-alive >>>>>>> balance leastconn >>>>>>> retries 2 >>>>>>> option redispatch >>>>>>> maxconn 25000 >>>>>>> option splice-response >>>>>>> option tcp-smart-accept >>>>>>> option tcp-smart-connect >>>>>>> option splice-auto >>>>>>> timeout connect 5000ms >>>>>>> timeout client 30000ms >>>>>>> timeout server 30000ms >>>>>>> timeout client-fin 30000ms >>>>>>> timeout http-request 10000ms >>>>>>> timeout http-keep-alive 75000ms >>>>>>> timeout queue 10000ms >>>>>>> timeout tarpit 15000ms >>>>>>> >>>>>>> frontend fk-fe-upgrade-80 >>>>>>> mode http >>>>>>> default_backend fk-be-upgrade >>>>>>> bind <VIP>:80 process 1 >>>>>>> # (and so on for 48 processes). >>>>>>> >>>>>>> backend fk-be-upgrade >>>>>>> mode http >>>>>>> default-server maxconn 2000 slowstart >>>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>" >>>>>>> >>>>>>> 8. Configuration file for thread mode: >>>>>>> global >>>>>>> daemon >>>>>>> maxconn 26000 >>>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin >>>>>>> nbproc 1 >>>>>>> nbthread 48 >>>>>>> # cpu-map auto:1/1-48 0-39 >>>>>>> >>>>>>> defaults >>>>>>> option http-keep-alive >>>>>>> balance leastconn >>>>>>> retries 2 >>>>>>> option redispatch >>>>>>> maxconn 25000 >>>>>>> option splice-response >>>>>>> option tcp-smart-accept >>>>>>> option tcp-smart-connect >>>>>>> option splice-auto >>>>>>> timeout connect 5000ms >>>>>>> timeout client 30000ms >>>>>>> timeout server 30000ms >>>>>>> timeout client-fin 30000ms >>>>>>> timeout http-request 10000ms >>>>>>> timeout http-keep-alive 75000ms >>>>>>> timeout queue 10000ms >>>>>>> timeout tarpit 15000ms >>>>>>> >>>>>>> frontend fk-fe-upgrade-80 >>>>>>> mode http >>>>>>> bind <VIP>:80 process 1/1-48 >>>>>>> default_backend fk-be-upgrade >>>>>>> >>>>>>> backend fk-be-upgrade >>>>>>> mode http >>>>>>> default-server maxconn 2000 slowstart >>>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>" >>>>>>> >>>>>>> I had also captured 'perf' output for the system for thread vs >>>>>>> processes, >>>>>>> can send it later if required. >>>>>>> >>>>>>> Thanks, >>>>>>> - Krishna >>>>>>> >>>>>>