Thanks, will take a look!

On Thu, Oct 4, 2018 at 12:58 PM Илья Шипицин <chipits...@gmail.com> wrote:

> what I going to try (when I will have some spare time) is sampling with
> google perftools
>
> https://github.com/gperftools/gperftools
>
> they are great in cpu profiling.
> you can try them youself if you have time/wish :)
>
>
> чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) <
> krishna...@flipkart.com>:
>
>> 1. haproxy config: Same as given above (both processes and threads were
>> given in the mail)
>> 2. nginx: default, no changes.
>> 3. sysctl's: nothing set. All changes as described earlier (e.g.
>> irqbalance, irq pinning, etc).
>> 4. nf_conntrack: disabled
>> 5. dmesg: no messages.
>>
>> With the same system and settings, threads gives 18x lesser RPS than
>> processes, along with
>> the other 2 issues given in my mail today.
>>
>>
>> On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин <chipits...@gmail.com>
>> wrote:
>>
>>> haproxy config, nginx config
>>> non default sysctl (if any)
>>>
>>> as a side note, can you have a look at "dmesg" output ? do you have nf
>>> conntrack enabled ? what are its limits ?
>>>
>>> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
>>> krishna...@flipkart.com>:
>>>
>>>> Sure.
>>>>
>>>> 1. Client: Use one of the following two setup's:
>>>>         - a single baremetal (48 core, 40g) system
>>>>           Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
>>>>         - 100 2 core vm's.
>>>>           Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
>>>>           each VM and summarize the results using some
>>>>           parallel-ssh setup.
>>>>
>>>> 2. HAProxy running on a single baremetal (same system config
>>>>     as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
>>>>     cores of the same NUMA node for each irq, kill irqbalance, with
>>>>     haproxy configuration file as given in my first mail. Around 60
>>>>     backend servers are configured in haproxy.
>>>>
>>>> 3. Backend servers are 2 core VM's running nginx and serving
>>>>     a file called "/128", which is 128 bytes in size.
>>>>
>>>> Let me know if you need more information.
>>>>
>>>> Thanks,
>>>> - Krishna
>>>>
>>>>
>>>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин <chipits...@gmail.com>
>>>> wrote:
>>>>
>>>>> load testing is somewhat good.
>>>>> can you describe an overall setup ? (I want to reproduce and play with
>>>>> it)
>>>>>
>>>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
>>>>> krishna...@flipkart.com>:
>>>>>
>>>>>> Re-sending in case this mail was missed. To summarise the 3 issues
>>>>>> seen:
>>>>>>
>>>>>> 1. Performance drops 18x with higher number of nbthreads as compared
>>>>>> to nbprocs.
>>>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
>>>>>> (for 1.9-dev3
>>>>>>     for nbprocs and nbthreads).
>>>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it
>>>>>> remains in either
>>>>>>      CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>>>>>> servers), till
>>>>>>      the server/client timeout expires.
>>>>>>
>>>>>> The tests for threads and processes were done on the same systems, so
>>>>>> there is
>>>>>> no difference in system parameters.
>>>>>>
>>>>>> Thanks,
>>>>>> - Krishna
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>>>>>> krishna...@flipkart.com> wrote:
>>>>>>
>>>>>>> Hi Willy, and community developers,
>>>>>>>
>>>>>>> I am not sure if I am doing something wrong, but wanted to report
>>>>>>> some issues that I am seeing. Please let me know if this is a
>>>>>>> problem.
>>>>>>>
>>>>>>> 1. HAProxy system:
>>>>>>> Kernel: 4.17.13,
>>>>>>> CPU: 48 core E5-2670 v3
>>>>>>> Memory: 128GB memory
>>>>>>> NIC: Mellanox 40g with IRQ pinning
>>>>>>>
>>>>>>> 2. Client, 48 core similar to server. Test command line:
>>>>>>> wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
>>>>>>>
>>>>>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
>>>>>>> checkout as of
>>>>>>>     Oct 2nd).
>>>>>>> # haproxy-git -vv
>>>>>>> HA-Proxy version 1.9-dev3 2018/09/29
>>>>>>> Copyright 2000-2018 Willy Tarreau <wi...@haproxy.org>
>>>>>>>
>>>>>>> Build options :
>>>>>>>   TARGET  = linux2628
>>>>>>>   CPU     = generic
>>>>>>>   CC      = gcc
>>>>>>>   CFLAGS  = -O2 -g -fno-strict-aliasing
>>>>>>> -Wdeclaration-after-statement -fwrapv -fno-strict-overflow
>>>>>>> -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
>>>>>>> -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
>>>>>>> -Wno-missing-field-initializers -Wtype-limits
>>>>>>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>>>>>>
>>>>>>> Default settings :
>>>>>>>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents
>>>>>>> = 200
>>>>>>>
>>>>>>> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>>>>>> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>>>>>> OpenSSL library supports TLS extensions : yes
>>>>>>> OpenSSL library supports SNI : yes
>>>>>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
>>>>>>> Built with transparent proxy support using: IP_TRANSPARENT
>>>>>>> IPV6_TRANSPARENT IP_FREEBIND
>>>>>>> Encrypted password support via crypt(3): yes
>>>>>>> Built with multi-threading support.
>>>>>>> Built with PCRE version : 8.38 2015-11-23
>>>>>>> Running on PCRE version : 8.38 2015-11-23
>>>>>>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>>>>>>> Built with zlib version : 1.2.8
>>>>>>> Running on zlib version : 1.2.8
>>>>>>> Compression algorithms supported : identity("identity"),
>>>>>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
>>>>>>> Built with network namespace support.
>>>>>>>
>>>>>>> Available polling systems :
>>>>>>>       epoll : pref=300,  test result OK
>>>>>>>        poll : pref=200,  test result OK
>>>>>>>      select : pref=150,  test result OK
>>>>>>> Total: 3 (3 usable), will use epoll.
>>>>>>>
>>>>>>> Available multiplexer protocols :
>>>>>>> (protocols markes as <default> cannot be specified using 'proto'
>>>>>>> keyword)
>>>>>>>               h2 : mode=HTTP       side=FE
>>>>>>>        <default> : mode=TCP|HTTP   side=FE|BE
>>>>>>>
>>>>>>> Available filters :
>>>>>>> [SPOE] spoe
>>>>>>> [COMP] compression
>>>>>>> [TRACE] trace
>>>>>>>
>>>>>>> 4. HAProxy results for #processes and #threads
>>>>>>> #    Threads-RPS Procs-RPS
>>>>>>> 1 20903 19280
>>>>>>> 2 46400 51045
>>>>>>> 4 96587 142801
>>>>>>> 8 172224 254720
>>>>>>> 12 210451 437488
>>>>>>> 16 173034 437375
>>>>>>> 24 79069 519367
>>>>>>> 32 55607 586367
>>>>>>> 48 31739 596148
>>>>>>>
>>>>>>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot
>>>>>>> more time
>>>>>>>    to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
>>>>>>> Stats about Lock FD:
>>>>>>> # write lock  : 143933900
>>>>>>> # write unlock: 143933895 (-5)
>>>>>>> # wait time for write     : 11370.245 msec
>>>>>>> # wait time for write/lock: 78.996 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock TASK_RQ:
>>>>>>> # write lock  : 2062874
>>>>>>> # write unlock: 2062875 (1)
>>>>>>> # wait time for write     : 7820.234 msec
>>>>>>> # wait time for write/lock: 3790.941 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock TASK_WQ:
>>>>>>> # write lock  : 2601227
>>>>>>> # write unlock: 2601227 (0)
>>>>>>> # wait time for write     : 5019.811 msec
>>>>>>> # wait time for write/lock: 1929.786 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock POOL:
>>>>>>> # write lock  : 2823393
>>>>>>> # write unlock: 2823393 (0)
>>>>>>> # wait time for write     : 11984.706 msec
>>>>>>> # wait time for write/lock: 4244.788 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock LISTENER:
>>>>>>> # write lock  : 184
>>>>>>> # write unlock: 184 (0)
>>>>>>> # wait time for write     : 0.011 msec
>>>>>>> # wait time for write/lock: 60.554 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock PROXY:
>>>>>>> # write lock  : 291557
>>>>>>> # write unlock: 291557 (0)
>>>>>>> # wait time for write     : 109.694 msec
>>>>>>> # wait time for write/lock: 376.235 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock SERVER:
>>>>>>> # write lock  : 1188511
>>>>>>> # write unlock: 1188511 (0)
>>>>>>> # wait time for write     : 854.171 msec
>>>>>>> # wait time for write/lock: 718.690 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock LBPRM:
>>>>>>> # write lock  : 1184709
>>>>>>> # write unlock: 1184709 (0)
>>>>>>> # wait time for write     : 778.947 msec
>>>>>>> # wait time for write/lock: 657.501 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock BUF_WQ:
>>>>>>> # write lock  : 669247
>>>>>>> # write unlock: 669247 (0)
>>>>>>> # wait time for write     : 252.265 msec
>>>>>>> # wait time for write/lock: 376.939 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock STRMS:
>>>>>>> # write lock  : 9335
>>>>>>> # write unlock: 9335 (0)
>>>>>>> # wait time for write     : 0.910 msec
>>>>>>> # wait time for write/lock: 97.492 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>> Stats about Lock VARS:
>>>>>>> # write lock  : 901947
>>>>>>> # write unlock: 901947 (0)
>>>>>>> # wait time for write     : 299.224 msec
>>>>>>> # wait time for write/lock: 331.753 nsec
>>>>>>> # read lock   : 0
>>>>>>> # read unlock : 0 (0)
>>>>>>> # wait time for read      : 0.000 msec
>>>>>>> # wait time for read/lock : 0.000 nsec
>>>>>>>
>>>>>>> 6. CPU utilization after test for processes/threads:
>>>>>>> haproxy-1.9-dev3 runs
>>>>>>> at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
>>>>>>>         this behavior was not seen. Ran the following command for
>>>>>>> both:
>>>>>>> "ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
>>>>>>>     1.8.14 during test:
>>>>>>> 451 SYN-SENT
>>>>>>> 9166 ESTAB
>>>>>>>     1.8.14 after test:
>>>>>>> 2 ESTAB
>>>>>>>
>>>>>>>     1.9-dev3 during test:
>>>>>>> 109 SYN-SENT
>>>>>>> 9400 ESTAB
>>>>>>>     1.9-dev3 after test:
>>>>>>> 2185 CLOSE-WAIT
>>>>>>> 2187 ESTAB
>>>>>>>     All connections that were in CLOSE-WAIT were from the client,
>>>>>>> while all
>>>>>>>     connections in ESTAB state were to the server. This lasted for
>>>>>>> 30 seconds.
>>>>>>>     On the client system, all sockets were in FIN-WAIT-2 state:
>>>>>>>     2186 FIN-WAIT-2
>>>>>>>     This (2185/2186) seems to imply that client closed the
>>>>>>> connection but
>>>>>>>     haproxy did not close the socket for 30 seconds. This also
>>>>>>> results in
>>>>>>>     high CPU utilization on haproxy for some reason (100% for each
>>>>>>> process
>>>>>>>     for 30 seconds), which is also unexpected as the remote side has
>>>>>>> closed the
>>>>>>>     socket.
>>>>>>>
>>>>>>> 7. Configuration file for process mode:
>>>>>>> global
>>>>>>> daemon
>>>>>>> maxconn 26000
>>>>>>> nbproc 48
>>>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
>>>>>>> # (and so on for 48 processes).
>>>>>>>
>>>>>>> defaults
>>>>>>> option http-keep-alive
>>>>>>> balance leastconn
>>>>>>> retries 2
>>>>>>> option redispatch
>>>>>>> maxconn 25000
>>>>>>> option splice-response
>>>>>>> option tcp-smart-accept
>>>>>>> option tcp-smart-connect
>>>>>>> option splice-auto
>>>>>>> timeout connect 5000ms
>>>>>>> timeout client 30000ms
>>>>>>> timeout server 30000ms
>>>>>>> timeout client-fin 30000ms
>>>>>>> timeout http-request 10000ms
>>>>>>> timeout http-keep-alive 75000ms
>>>>>>> timeout queue 10000ms
>>>>>>> timeout tarpit 15000ms
>>>>>>>
>>>>>>> frontend fk-fe-upgrade-80
>>>>>>> mode http
>>>>>>> default_backend fk-be-upgrade
>>>>>>> bind <VIP>:80 process 1
>>>>>>> # (and so on for 48 processes).
>>>>>>>
>>>>>>> backend fk-be-upgrade
>>>>>>> mode http
>>>>>>> default-server maxconn 2000 slowstart
>>>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>>>>>
>>>>>>> 8. Configuration file for thread mode:
>>>>>>> global
>>>>>>>         daemon
>>>>>>> maxconn 26000
>>>>>>> stats socket /var/run/ha-1-admin.sock mode 600 level admin
>>>>>>> nbproc 1
>>>>>>> nbthread 48
>>>>>>> # cpu-map auto:1/1-48 0-39
>>>>>>>
>>>>>>> defaults
>>>>>>> option http-keep-alive
>>>>>>> balance leastconn
>>>>>>> retries 2
>>>>>>> option redispatch
>>>>>>> maxconn 25000
>>>>>>> option splice-response
>>>>>>> option tcp-smart-accept
>>>>>>> option tcp-smart-connect
>>>>>>> option splice-auto
>>>>>>> timeout connect 5000ms
>>>>>>> timeout client 30000ms
>>>>>>> timeout server 30000ms
>>>>>>> timeout client-fin 30000ms
>>>>>>> timeout http-request 10000ms
>>>>>>> timeout http-keep-alive 75000ms
>>>>>>> timeout queue 10000ms
>>>>>>> timeout tarpit 15000ms
>>>>>>>
>>>>>>> frontend fk-fe-upgrade-80
>>>>>>> mode http
>>>>>>> bind <VIP>:80 process 1/1-48
>>>>>>> default_backend fk-be-upgrade
>>>>>>>
>>>>>>> backend fk-be-upgrade
>>>>>>> mode http
>>>>>>> default-server maxconn 2000 slowstart
>>>>>>> # 58 server lines follow, e.g.: "server <name> <ip:80>"
>>>>>>>
>>>>>>> I had also captured 'perf' output for the system for thread vs
>>>>>>> processes,
>>>>>>> can send it later if required.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> - Krishna
>>>>>>>
>>>>>>

Reply via email to