Hi Willy,

Op 10-6-2019 om 11:09 schreef Willy Tarreau:
Hi Pieter,

On Sat, Jun 08, 2019 at 06:07:09AM +0200, Willy Tarreau wrote:
Hi Pieter,

On Fri, Jun 07, 2019 at 11:32:18PM +0200, PiBa-NL wrote:
Hi Willy,

After the commit "6ec902a MINOR: threads: serialize threads initialization"
however i have failing / slow health checks in the tls_health_checks.vtc
test. Before that the Layer7-OK takes 5ms after this commit the healthcheck
takes up to 75ms or even more.. It causes the 20ms connect/server timeouts
to also to fail the test fairly often but not always..
This is very strange, as the modification only involves threads startup.
Hmmm actually I'm starting to think about a possibility I need to verify.
I suspect it may happen that a thread manages to finish its initialization
before others request synchronzation, thus belives it's alone and starts.
I'm going to have a deeper look at this problem with this in mind. I didn't
notice the failed check here but I'll hammer it a bit more.
Sorry for the long silence, it was harder than I thought. So I never managed
to reproduce this typical issue, even by adding random delays here and there,
but I managed to see that some threads were starting the event loop before
others were done initializing, which will obviously result in issues such as
missed events that could result in what you observed.

I initially thought I could easily add a synchronization step using the
current two bit fields (and spent my whole week-end writing parallel
algorithms and revisiting all our locking mechanism just because of this).
After numerous failed attempts, I later figured that I needed to represent
more than 4 states per thread and that 2 bits are not enough. Bah... at
least I had fun time... Thus I added a new field and a simple function to
allow the code to start in synchronous steps. We now initialize one thread
at a time, then once they are all initialized we enable the listeners, and
once they are enabled, we start the pollers in all threads. It is pretty
obvious from the traces that it now does the right thing. However since I
couldn't reproduce the health check issue you were facing, I'm interested
in knowing if it's still present with the latest master, as it could also
uncover another issue.


Things certainly look better again now regarding this issue.

Running the test repeatedly, and manually looking over the results its pretty much as good as it was before. There seems to be a 1 ms increase in the check-duration, but maybe this is because of the moved initialization which on startup delays the first test a millisecond or something?

Below some test results that are based on manual observation and some in my head filtering of the console output.. (mistakes included ;) ) repeat 10 ./vt -v ./work/haproxy-*/reg-tests/checks/tls_health_checks.vtc | grep Layer7 | grep OK | grep WARNING
Commit-ID , min-max time for +-95% check durations , comment
e4d7c9d , 6 - 9 ms ,     all tests pass  (1 tests out of +- a hundred showed 29ms , none below 6ms and almost half of them show 7ms)
6ec902a , 11 - 150 ms ,  of the 12 tests that passed
e186161 , 5 - 8 ms ,     all tests pass  (1 test used 15 ms, more than half the tests show 5ms check duration the majority of the remainder show 6ms)

I'm not sure if this deserves further investigation at the moment, i think it does not. Thanks for spending your weekend on this :) that wasn't my intention.

PiBa-NL (Pieter)

Reply via email to