Hi Willy,
Op 30-9-2018 om 7:56 schreef Willy Tarreau:
On Sun, Sep 30, 2018 at 07:46:24AM +0200, Willy Tarreau wrote:
Well, at least it works fine on 1.8 and not on 1.9-dev3 so I think you
spotted a regression that we have to analyse. However, I'd like to merge
the fix before merging the regtest otherwise it will kill the reg-test
feature until we manage to get the issue fixed!
By the way, could you please explain in simple words the issue you've
noticed ? I tried to reverse the vtc file but I don't understand the
details nor what it tries to achieve. When I'm running a simple test
on a simple config, the CummConns always matches the CumReq, and when
running this test I'm seeing random values there in the output, but I
also see that they are retrieved before all connections are closed
But CurrConns is 0, so connections are (supposed to be?) closed? :
**** h1 0.0 CLI recv|CurrConns: 0
**** h1 0.0 CLI recv|CumConns: 27
**** h1 0.0 CLI recv|CumReq: 27
, so
I'm not even sure the test is correct :-/
Thanks,
Willy
What i'm trying to achieve is, well.. testing for regressions that are
not yet known to exist on the current stable version.
So what this test does in short:
It makes 4 clients simultaneously send a request to a threaded haproxy,
which in turn connects 10x backend to frontend and then sends the
request to the s1 server. This with the intended purpose of having
several connections started and broken up as fast as haproxy can process
them while trying to have a high probability of adding/removing items
from lists/counters from different threads thus possibly creating
problems if some lock/sync isn't done correctly. After firing a few
requests it also verifies the expected counts, and results where possible..
History:
Ive been bit a few times with older releases by corruption occurring
inside the POST data when uploading large (500MB+) files to a server
behind haproxy. After a few megabytes are passed correctly the resulting
file would contain differences from their original when compared, the
upload 'seemed' to succeed though. (this was then solved by installing a
newer haproxy build..).. Also sometimes threads have locked up or
crashed things. Or kqueue scheduler turned out to behave differently
than others.. Ive been trying to test such things manually but found i
always forget to run some test. This is why i really like the concept of
having a set of defined tests that validate haproxy is working
'properly', on the OS i run it on.. Also when some issue i ran into gets
fixed i tend to run -dev builds on my production environment for a
while, and well its nice to know that other functionality still works as
it used to..
With writing this test i initially started with the idea of
automatically testing a large file transfer through haproxy, but then
thought where / how to leave such a file, so i thought of transferring a
'large' header with increasing size 'might' trigger a similar
condition.. Though in hindsight that might not actually test the same
code paths..
I created that test with 1 byte growth in the header together with 4000
connections didn't quite achieve that initial big file simulation, but
still i thought it ended up to be a nice test. So submitted it a while
back ;) .. Anyhow haproxy wasn't capable of doing much when dev2 was
tagged so i wasnt to worried the test failed at that time.. And you
announced dev2 as such as well, so that was okay. And perhaps the issue
found then would solve itself when further fixes on top of dev2 were
added ;).
Anyhow with dev3 i hoped all regressions would be fixed, and found this
one still failed on 1.9dev3. So it tuned the numbers in the previous
submitted regtest down a little to avoid conntrack/sysctl default
limits, while still failing the test 'reliably'.. I'm not sure what
exactly is going on, or how bad it is that these numbers don't match up
anymore.. Maybe its only the counter thats not updated in a thread safe
way, perhaps there is a bigger issue lurking with sync points and
whatnot..? Either way the test should pass as i understand it, the 4
defined varnish clients got their answer back and Currconns = 0, also
adding a 3 second delay between waiting for the clients and checking the
stats does not fix it... And as youve checked with 1.8 it does pass.
Though that to could perhaps be a coincidence, maybe now things are
processed even faster now but in different order so the test fails for
the wrong reason.?.
Hope that makes some sense in my thought process :).
Regards,
PiBa-NL (Pieter)