Hi Willy,
Op 11-6-2019 om 11:37 schreef Willy Tarreau:
On Tue, Jun 11, 2019 at 09:06:46AM +0200, Willy Tarreau wrote:
I'd like you to give it a try in your environment to confirm whether or
not it does improve things. If so, I'll clean it up and merge it. I'm
also interested in any reproducer you could have, given that the made up
test case I did above doesn't even show anything alarming.
No need to waste your time anymore, I now found how to reproduce it with
this config :
global
stats socket /tmp/sock1 mode 666 level admin
nbthread 64
backend stopme
timeout server 1s
option tcp-check
tcp-check send "debug dev exit\n"
server cli unix@/tmp/sock1 check
The I run it in loops bound to different CPU counts :
$ time for i in {1..20}; do
taskset -c 0,1,2,3 ./haproxy -db -f slow-init.cfg >/dev/null 2>&1
done
With a single CPU, it can take up to 10 seconds to run the loop on
commits e186161 and e4d7c9d while it takes 0.18 second with the patch.
With 4 CPUs like above, it takes 1.5s with e186161, 2.3s with e4d7c9d
and 0.16 second with the patch.
The tests I had run consisted in starting hundreds of thousands of
listeners to amplify the impact of the start time, but in the end
it was diluting the extra time in an already very long time. Running
it in loops like above is quite close to what regtests do and explains
why I couldn't spot the difference (e.g. a few hundreds of ms at worst
among tens of seconds).
Thus I'm merging the patch now (cleaned up already and tested as well
without threads).
Let's hope it's the last time :-)
Thanks,
Willy
Seems i kept you busy for another day.. But the result is there, it
looks 100% fixed to me :).
Running without nbthread, and as such using 16 threads of the VM i'm
using, i now get this:
2.0-dev7-ca3551f 2019/06/11
** h1 0.732 WAIT4 pid=80796 status=0x0002 (user 0.055515 sys 0.039653)
** h2 0.846 WAIT4 pid=80799 status=0x0002 (user 0.039039 sys 0.039039)
# top TEST ./test/tls_health_checks-org.vtc passed (0.848)
Also with repeating the testcase 1000 times while running 10 of them in
parallel only 1 of them failed with a timeout:
**** S1 0.280 syslog|<134>Jun 11 22:24:48 haproxy[88306]:
::1:63856 [11/Jun/2019:22:24:48.074] fe1/1: Timeout during SSL handshake
I think together with the really short timeouts in the testcase itself
this is an excellent result.
I'm considering this one fully fixed, thanks again.
Regards,
PiBa-NL (Pieter)