Hi Willy,

Op 11-6-2019 om 11:37 schreef Willy Tarreau:
On Tue, Jun 11, 2019 at 09:06:46AM +0200, Willy Tarreau wrote:
I'd like you to give it a try in your environment to confirm whether or
not it does improve things. If so, I'll clean it up and merge it. I'm
also interested in any reproducer you could have, given that the made up
test case I did above doesn't even show anything alarming.
No need to waste your time anymore, I now found how to reproduce it with
this config :

     global
        stats socket /tmp/sock1 mode 666 level admin
        nbthread 64

     backend stopme
        timeout server  1s
        option tcp-check
        tcp-check send "debug dev exit\n"
        server cli unix@/tmp/sock1 check

The I run it in loops bound to different CPU counts :

    $ time for i in {1..20}; do
         taskset -c 0,1,2,3 ./haproxy -db -f slow-init.cfg >/dev/null 2>&1
      done

With a single CPU, it can take up to 10 seconds to run the loop on
commits e186161 and e4d7c9d while it takes 0.18 second with the patch.

With 4 CPUs like above, it takes 1.5s with e186161, 2.3s with e4d7c9d
and 0.16 second with the patch.

The tests I had run consisted in starting hundreds of thousands of
listeners to amplify the impact of the start time, but in the end
it was diluting the extra time in an already very long time. Running
it in loops like above is quite close to what regtests do and explains
why I couldn't spot the difference (e.g. a few hundreds of ms at worst
among tens of seconds).

Thus I'm merging the patch now (cleaned up already and tested as well
without threads).

Let's hope it's the last time :-)

Thanks,
Willy

Seems i kept you busy for another day.. But the result is there, it looks 100% fixed to me :).

Running without nbthread, and as such using 16 threads of the VM i'm using, i now get this:
2.0-dev7-ca3551f 2019/06/11
**   h1    0.732 WAIT4 pid=80796 status=0x0002 (user 0.055515 sys 0.039653)
**   h2    0.846 WAIT4 pid=80799 status=0x0002 (user 0.039039 sys 0.039039)
#    top  TEST ./test/tls_health_checks-org.vtc passed (0.848)

Also with repeating the testcase 1000 times while running 10 of them in parallel only 1 of them failed with a timeout:     **** S1    0.280 syslog|<134>Jun 11 22:24:48 haproxy[88306]: ::1:63856 [11/Jun/2019:22:24:48.074] fe1/1: Timeout during SSL handshake I think together with the really short timeouts in the testcase itself this is an excellent result.

I'm considering this one fully fixed, thanks again.

Regards,
PiBa-NL (Pieter)


Reply via email to