Re: [PATCH] DOC: clarify force-private-cache is an option

2018-09-30 Thread Willy Tarreau
On Mon, Oct 01, 2018 at 02:00:16AM +0200, Lukas Tribus wrote:
> "boolean" may confuse users into thinking they need to provide
> additional arguments, like false or true. This is a simple option
> like many others, so lets not confuse the users with internals.
> 
> Also fixes an additional typo.
> 
> Should be backported to 1.8 and 1.7.

Applied, thank you Lukas.

Willy



[PATCH] DOC: clarify force-private-cache is an option

2018-09-30 Thread Lukas Tribus
"boolean" may confuse users into thinking they need to provide
additional arguments, like false or true. This is a simple option
like many others, so lets not confuse the users with internals.

Also fixes an additional typo.

Should be backported to 1.8 and 1.7.
---
 doc/configuration.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 336ef1f..d890b0b 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -1660,7 +1660,7 @@ tune.ssl.cachesize 
   this value to 0 disables the SSL session cache.
 
 tune.ssl.force-private-cache
-  This boolean disables SSL session cache sharing between all processes. It
+  This option disables SSL session cache sharing between all processes. It
   should normally not be used since it will force many renegotiations due to
   clients hitting a random process. But it may be required on some operating
   systems where none of the SSL cache synchronization method may be used. In
@@ -6592,7 +6592,7 @@ option smtpchk  
  yes   |no|   yes  |   yes
   Arguments :
is an optional argument. It is the "hello" command to use. It can
-  be either "HELO" (for SMTP) or "EHLO" (for ESTMP). All other
+  be either "HELO" (for SMTP) or "EHLO" (for ESMTP). All other
   values will be turned into the default command ("HELO").
 
   is the domain name to present to the server. It may only be
-- 
2.7.4



Re: [PATCH] REGTEST/MINOR: loadtest: add a test for connection counters

2018-09-30 Thread PiBa-NL

Hi Willy,
Op 30-9-2018 om 20:38 schreef Willy Tarreau:

On Sun, Sep 30, 2018 at 08:22:23PM +0200, Willy Tarreau wrote:

On Sun, Sep 30, 2018 at 07:59:34PM +0200, PiBa-NL wrote:

Indeed it works with 1.8, so in that regard i 'think' the test itself is
correct.. Also when disabling threads, or running only 1 client, it still
works.. Then both CumConns CumReq show 11 for the first stats result.

Hmmm for me it fails even without threads. That was the first thing I
tried when meeting the error in fact. But I need to dig deeper.

So I'm seeing that in fact the count is correct if the server connection
closes first, and wrong otherwise. In fact it fails similarly both for
1.6, 1.7, 1.8 and 1.9 with and without threads. I'm seeing that the
connection count is exactly 10 times the incoming connections while the
request count is exactly 20 times this count. I suspect that what happens
is that the request count is increased on each connection when preparing
to receive a new request. This even slightly reminds me something but
I don't know where I noticed something like this, I think I saw this
when reviewing the changes needed to be made to HTTP for the native
internal representation.

So I think it's a minor bug, but not a regression.

Thanks,
Willy


Not sure, only difference between 100x FAILED and 100x OK is the version 
here. Command executed and result below.


Perhaps that's just because of the OS / Scheduler used though, i assume 
your using some linux distro to test with, perhaps that explains part of 
the differences between your and my results.. In the end it doesn't 
matter much if its a bug or a regression still needs a fix ;). And well 
i don't know if its just the counter thats wrong, or there might be 
bigger consequences somewhere. if its just the counter then i guess it 
wouldn't hurt much to postpone a fix to a next (dev?) version.


Regards,

PiBa-NL (Pieter)

root@freebsd11:/usr/ports/net/haproxy-devel # varnishtest -q -n 100 -j 
16 -k ./haproxy_test_OK_20180831/loadtest/b0-loadtest.vtc

...
#    top  TEST ./haproxy_test_OK_20180831/loadtest/b0-loadtest.vtc 
FAILED (0.128) exit=2
#    top  TEST ./haproxy_test_OK_20180831/loadtest/b0-loadtest.vtc 
FAILED (0.135) exit=2

100 tests failed, 0 tests skipped, 0 tests passed
root@freebsd11:/usr/ports/net/haproxy-devel # haproxy -v
HA-Proxy version 1.9-dev3-27010f0 2018/09/29
Copyright 2000-2018 Willy Tarreau 

root@freebsd11:/usr/ports/net/haproxy-devel # pkg add -f 
haproxy-1.8.14-selfbuild-reg-tests-OK.txz

Installing haproxy-1.8...
package haproxy is already installed, forced install
Extracting haproxy-1.8: 100%
root@freebsd11:/usr/ports/net/haproxy-devel # varnishtest -q -n 100 -j 
16 -k ./haproxy_test_OK_20180831/loadtest/b0-loadtest.vtc

0 tests failed, 0 tests skipped, 100 tests passed
root@freebsd11:/usr/ports/net/haproxy-devel # haproxy -v
HA-Proxy version 1.8.14-52e4d43 2018/09/20
Copyright 2000-2018 Willy Tarreau 






Re: [PATCH] REGTEST/MINOR: loadtest: add a test for connection counters

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 08:22:23PM +0200, Willy Tarreau wrote:
> On Sun, Sep 30, 2018 at 07:59:34PM +0200, PiBa-NL wrote:
> > Indeed it works with 1.8, so in that regard i 'think' the test itself is
> > correct.. Also when disabling threads, or running only 1 client, it still
> > works.. Then both CumConns CumReq show 11 for the first stats result.
> 
> Hmmm for me it fails even without threads. That was the first thing I
> tried when meeting the error in fact. But I need to dig deeper.

So I'm seeing that in fact the count is correct if the server connection
closes first, and wrong otherwise. In fact it fails similarly both for
1.6, 1.7, 1.8 and 1.9 with and without threads. I'm seeing that the
connection count is exactly 10 times the incoming connections while the
request count is exactly 20 times this count. I suspect that what happens
is that the request count is increased on each connection when preparing
to receive a new request. This even slightly reminds me something but
I don't know where I noticed something like this, I think I saw this
when reviewing the changes needed to be made to HTTP for the native
internal representation.

So I think it's a minor bug, but not a regression.

Thanks,
Willy



Re: [PATCH] REGTEST/MINOR: loadtest: add a test for connection counters

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 07:59:34PM +0200, PiBa-NL wrote:
> Indeed it works with 1.8, so in that regard i 'think' the test itself is
> correct.. Also when disabling threads, or running only 1 client, it still
> works.. Then both CumConns CumReq show 11 for the first stats result.

Hmmm for me it fails even without threads. That was the first thing I
tried when meeting the error in fact. But I need to dig deeper.

> > However, I'd like to merge
> > the fix before merging the regtest otherwise it will kill the reg-test
> > feature until we manage to get the issue fixed!
> I'm not fully sure i agree on that.. While i understand that failing
> reg-tests can be a pita while developing (if you run them regulary) the fact
> is that currently existing tests can already already start to fail after
> some major redesign of the code, a few mails back (different mailthread) i
> tested like 10 commits in a row and they all suffered from different failing
> tests, that would imho not be a reason to remove those tests, and they didnt
> stop development.

The reason is that for now we have no way to let the tests fail gracefully
and report what is OK and what is not. So any error that's in the way will
lead to an absolutely certain behaviour from everyone : nobody will run the
tests anymore since the result will be known.

Don't get me wrong, I'm willing to get as many tests as we can, but 1) we
have to be sure these tests only fail for regressions and not for other
reasons, and 2) we must be sure that these tests do not prevent other ones
from being run nor make it impossible to observe the progress on other
ones. We're still at the beginning with reg tests, and as you can see we
have not even yet sorted out the requirements for some of them like threads
or Lua or whatever else.

I'm just asking that we don't create tests faster than we can sort them
out, that's all. This probably means that we really have to work on these
two main areas which are test prerequisites and synthetic reports of what
worked and what failed.

Ideas and proposals on this are welcome, but to be honest I can't spend
as much time as I'd want on this for now given how late we are on all
what remains to be done, so I really welcome discussions and help on the
subject between the various actors.

Thanks,
Willy



Re: [PATCH] REGTEST/MINOR: loadtest: add a test for connection counters

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 07:15:59PM +0200, PiBa-NL wrote:
> > on a simple config, the CummConns always matches the CumReq, and when
> > running this test I'm seeing random values there in the output, but I
> > also see that they are retrieved before all connections are closed
> But CurrConns is 0, so connections are (supposed to be?) closed? :
> 
>  h1    0.0 CLI recv|CurrConns: 0
>  h1    0.0 CLI recv|CumConns: 27
>  h1    0.0 CLI recv|CumReq: 27

You're totally right, I think I confused CumConns and CurrConns when
looking at the output. With that said I have no idea what's going on,
I'll have another look.

Thanks,
Willy



Re: [PATCH] REGTEST/MINOR: loadtest: add a test for connection counters

2018-09-30 Thread PiBa-NL

Hi Willy,

Op 30-9-2018 om 7:46 schreef Willy Tarreau:

Hi Pieter,

On Sun, Sep 30, 2018 at 12:05:14AM +0200, PiBa-NL wrote:

Hi Willy,

I thought lets give those reg-test another try :) as its easy to run and
dev3 just came out.
All tests pass on my FreeBSD system, except this one, new reg-test attached.

Pretty much the same test as previously send, but now with only 4 x 10
connections. Which should be fine for conntrack and sysctls (i hope..). It
seems those stats numbers are 'off', or is my expected value not as fixed as
i thought it would be?

Well, at least it works fine on 1.8 and not on 1.9-dev3 so I think you
spotted a regression that we have to analyse.
Indeed it works with 1.8, so in that regard i 'think' the test itself is 
correct.. Also when disabling threads, or running only 1 client, it 
still works.. Then both CumConns CumReq show 11 for the first stats result.

However, I'd like to merge
the fix before merging the regtest otherwise it will kill the reg-test
feature until we manage to get the issue fixed!
I'm not fully sure i agree on that.. While i understand that failing 
reg-tests can be a pita while developing (if you run them regulary) the 
fact is that currently existing tests can already already start to fail 
after some major redesign of the code, a few mails back (different 
mailthread) i tested like 10 commits in a row and they all suffered from 
different failing tests, that would imho not be a reason to remove those 
tests, and they didnt stop development.

I'm also seeing that you rely on threads, I think I noticed another test
involving threads. Probably that we should have a specific directory for
these ones that we can disable completely when threads are not enabled,
otherwise this will also destroy tests (and make them extremely slow due
to varnishtest waiting for the timeout if haproxy refuses to parse the
config).
A specific directory will imho not work. How should it be called? 
/threaded_lua_with_ssl_using_kqueue_scheduler_on_freebsd_without_absn_for_haproxy_1.9_and_higher/ 
?
Having varnishtest fail while waiting for a feature that was not 
compiled is indeed undesirable as well. So some 'smart' way of defining 
'requirements' for a test will be needed so they can gracefully skip if 
not applicable.. I'm not sure myself how that way should look though.. 
On one side i think the .vtc itself might be the place to define what 
requirements it has, on the other the other a separate list/script 
including logic of what tests to run could be nice.. But then who is 
going to maintain that one..

I think that we should think a bit forward based on these tests. We must
not let varnishtest stop on the first error but rather just log it.

varnishtest can continue on error with -k
Using this little mytest.sh script at the moment, this runs all tests 
and only failed tests produce a lot of logging..:

  haproxy -v
  varnishtest -j 16 -k -t 20 ./work/haproxy-*/reg-tests/*/*.vtc > 
./mytest-result.log 2>&1
  varnishtest -j 16 -k -t 20 ./haproxy_test_OK_20180831/*/*.vtc >> 
./mytest-result.log 2>&1

  cat ./mytest-result.log
  echo "" >> ./mytest-result.log
  haproxy -vv  >> ./mytest-result.log

There is also the -q parameter, but then it doesn't log anymore what 
tests passed and would only the failed tests will produce 1 log line.. 
(i do like to log what tests where executed though..)

  Then
at the end we could produce a report of successes and failures that would
be easy to diff from the previous (or expected) one. That will be
particularly useful when running the tests on older releases. As an
example, I had to run your test manually on 1.8 because for I-don't-know-
what-reason, the one about the proxy protocol now fails while it used to
work fine last week for the 1.8.14 release. That's a shame that we can't
complete tests just because one randomly fails.
You can continue tests. ( -k ) But better write it out to a logfile 
then, or perhaps combine with -l which leaves the /tmp/.vtc folder..

Thanks,
Willy


Regards,
PiBa-NL (Pieter)




Re: [PATCH] REGTEST/MINOR: loadtest: add a test for connection counters

2018-09-30 Thread PiBa-NL

Hi Willy,

Op 30-9-2018 om 7:56 schreef Willy Tarreau:

On Sun, Sep 30, 2018 at 07:46:24AM +0200, Willy Tarreau wrote:

Well, at least it works fine on 1.8 and not on 1.9-dev3 so I think you
spotted a regression that we have to analyse. However, I'd like to merge
the fix before merging the regtest otherwise it will kill the reg-test
feature until we manage to get the issue fixed!

By the way, could you please explain in simple words the issue you've
noticed ? I tried to reverse the vtc file but I don't understand the
details nor what it tries to achieve. When I'm running a simple test
on a simple config, the CummConns always matches the CumReq, and when
running this test I'm seeing random values there in the output, but I
also see that they are retrieved before all connections are closed

But CurrConns is 0, so connections are (supposed to be?) closed? :

 h1    0.0 CLI recv|CurrConns: 0
 h1    0.0 CLI recv|CumConns: 27
 h1    0.0 CLI recv|CumReq: 27


, so
I'm not even sure the test is correct :-/

Thanks,
Willy


What i'm trying to achieve is, well.. testing for regressions that are 
not yet known to exist on the current stable version.


So what this test does in short:
It makes 4 clients simultaneously send a request to a threaded haproxy, 
which in turn connects 10x backend to frontend and then sends the 
request to the s1 server. This with the intended purpose of having 
several connections started and broken up as fast as haproxy can process 
them while trying to have a high probability of adding/removing items 
from lists/counters from different threads thus possibly creating 
problems if some lock/sync isn't done correctly. After firing a few 
requests it also verifies the expected counts, and results where possible..


History:
Ive been bit a few times with older releases by corruption occurring 
inside the POST data when uploading large (500MB+) files to a server 
behind haproxy. After a few megabytes are passed correctly the resulting 
file would contain differences from their original when compared, the 
upload 'seemed' to succeed though. (this was then solved by installing a 
newer haproxy build..).. Also sometimes threads have locked up or 
crashed things. Or kqueue scheduler turned out to behave differently 
than others.. Ive been trying to test such things manually but found i 
always forget to run some test. This is why i really like the concept of 
having a set of defined tests that validate haproxy is working 
'properly', on the OS i run it on.. Also when some issue i ran into gets 
fixed i tend to run -dev builds on my production environment for a 
while, and well its nice to know that other functionality still works as 
it used to..


With writing this test i initially started with the idea of 
automatically testing a large file transfer through haproxy, but then 
thought where / how to leave such a file, so i thought of transferring a 
'large' header with increasing size 'might' trigger a similar 
condition.. Though in hindsight that might not actually test the same 
code paths..


I created that test with 1 byte growth in the header together with 4000 
connections didn't quite achieve that initial big file simulation, but 
still i thought it ended up to be a nice test. So submitted it a while 
back ;) .. Anyhow haproxy wasn't capable of doing much when dev2 was 
tagged so i wasnt to worried the test failed at that time.. And you 
announced dev2 as such as well, so that was okay. And perhaps the issue 
found then would solve itself when further fixes on top of dev2 were 
added ;).


Anyhow with dev3 i hoped all regressions would be fixed, and found this 
one still failed on 1.9dev3. So it tuned the numbers in the previous 
submitted regtest down a little to avoid conntrack/sysctl default 
limits, while still failing the test 'reliably'.. I'm not sure what 
exactly is going on, or how bad it is that these numbers don't match up 
anymore.. Maybe its only the counter thats not updated in a thread safe 
way, perhaps there is a bigger issue lurking with sync points and 
whatnot..? Either way the test should pass as i understand it, the 4 
defined varnish clients got their answer back and Currconns = 0, also 
adding a 3 second delay between waiting for the clients and checking the 
stats does not fix it... And as youve checked with 1.8 it does pass. 
Though that to could perhaps be a coincidence, maybe now things are 
processed even faster now but in different order so the test fails for 
the wrong reason.?.


Hope that makes some sense in my thought process :).

Regards,

PiBa-NL (Pieter)




Re: Allow configuration of pcre-config path

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 03:54:14PM +0200, Fabrice Fontaine wrote:
> OK, thanks for your quick review, see attached patch, I made two variables
> PCRE_CONFIG and PCRE2_CONFIG.

Thank you, now applied.

Willy



Re: Allow configuration of pcre-config path

2018-09-30 Thread Fabrice Fontaine
Dear Willy,

Le dim. 30 sept. 2018 à 14:38, Willy Tarreau  a écrit :

> Hello Fabrice,
>
> On Sun, Sep 30, 2018 at 12:20:55PM +0200, Fabrice Fontaine wrote:
> > Dear all,
> >
> > I added haproxy to buildroot and to do so, I added a way of configuring
> the
> > path of pcre-config and pcre-config2.
>
> This looks OK however I think from a users' perspective that it would be
> better to let the user specify the path to the pcre-config command instead
> of only the directory containing it. This gives more flexibility, for
> example allowing to have a different name than "pcre-config". Maybe you
> could simply call that variable "PCRE_CONFIG" in this case.
>
OK, thanks for your quick review, see attached patch, I made two variables
PCRE_CONFIG and PCRE2_CONFIG.

>
> > So, please find attached a patch. As
> > this my first contribution to haproxy, please excuse me if I made any
> > mistakes.
>
> It's mostly OK. Please prefix the subject line with "BUILD:" so that we
> know it affects the build system (just run "git log Makefile" to see
> what we do), but that's just a cosmetic detail.
>
OK done.

>
> Thanks,
> Willy
>
Best Regards,

Fabrice
From 658cd370c3fa90484bfee1c493b7dd9c0248ac57 Mon Sep 17 00:00:00 2001
From: Fabrice Fontaine 
Date: Fri, 28 Sep 2018 19:21:26 +0200
Subject: [PATCH] BUILD: Allow configuration of pcre-config path

Add PCRE_CONFIG and PCRE2_CONFIG variables to allow the user to
configure path of pcre-config or pcre2-config instead of using the one
in his path.
This is particulary useful when cross-compiling.

Signed-off-by: Fabrice Fontaine 
---
 Makefile | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 382f944f..074e0169 100644
--- a/Makefile
+++ b/Makefile
@@ -78,9 +78,13 @@
 # Other variables :
 #   DLMALLOC_SRC   : build with dlmalloc, indicate the location of dlmalloc.c.
 #   DLMALLOC_THRES : should match PAGE_SIZE on every platform (default: 4096).
+#   PCRE_CONFIG: force the binary path to get pcre config (by default
+#  pcre-config)
 #   PCREDIR: force the path to libpcre.
 #   PCRE_LIB   : force the lib path to libpcre (defaults to $PCREDIR/lib).
 #   PCRE_INC   : force the include path to libpcre ($PCREDIR/inc)
+#   PCRE2_CONFIG   : force the binary path to get pcre2 config (by default
+#   pcre2-config)
 #   SSL_LIB: force the lib path to libssl/libcrypto
 #   SSL_INC: force the include path to libssl/libcrypto
 #   LUA_LIB: force the lib path to lua
@@ -734,7 +738,8 @@ endif
 # Forcing PCREDIR to an empty string will let the compiler use the default
 # locations.
 
-PCREDIR	:= $(shell pcre-config --prefix 2>/dev/null || echo /usr/local)
+PCRE_CONFIG	:= pcre-config
+PCREDIR	:= $(shell $(PCRE_CONFIG) --prefix 2>/dev/null || echo /usr/local)
 ifneq ($(PCREDIR),)
 PCRE_INC:= $(PCREDIR)/include
 PCRE_LIB:= $(PCREDIR)/lib
@@ -759,7 +764,8 @@ endif
 endif
 
 ifneq ($(USE_PCRE2)$(USE_STATIC_PCRE2)$(USE_PCRE2_JIT),)
-PCRE2DIR	:= $(shell pcre2-config --prefix 2>/dev/null || echo /usr/local)
+PCRE2_CONFIG 	:= pcre2-config
+PCRE2DIR	:= $(shell $(PCRE2_CONFIG) --prefix 2>/dev/null || echo /usr/local)
 ifneq ($(PCRE2DIR),)
 PCRE2_INC   := $(PCRE2DIR)/include
 PCRE2_LIB   := $(PCRE2DIR)/lib
@@ -777,7 +783,7 @@ endif
 endif
 
 
-PCRE2_LDFLAGS	:= $(shell pcre2-config --libs$(PCRE2_WIDTH) 2>/dev/null || echo -L/usr/local/lib -lpcre2-$(PCRE2_WIDTH))
+PCRE2_LDFLAGS	:= $(shell $(PCRE2_CONFIG) --libs$(PCRE2_WIDTH) 2>/dev/null || echo -L/usr/local/lib -lpcre2-$(PCRE2_WIDTH))
 
 ifeq ($(PCRE2_LDFLAGS),)
 $(error libpcre2-$(PCRE2_WIDTH) not found)
-- 
2.17.1



Re: Allow configuration of pcre-config path

2018-09-30 Thread Willy Tarreau
Hello Fabrice,

On Sun, Sep 30, 2018 at 12:20:55PM +0200, Fabrice Fontaine wrote:
> Dear all,
> 
> I added haproxy to buildroot and to do so, I added a way of configuring the
> path of pcre-config and pcre-config2.

This looks OK however I think from a users' perspective that it would be
better to let the user specify the path to the pcre-config command instead
of only the directory containing it. This gives more flexibility, for
example allowing to have a different name than "pcre-config". Maybe you
could simply call that variable "PCRE_CONFIG" in this case.

> So, please find attached a patch. As
> this my first contribution to haproxy, please excuse me if I made any
> mistakes.

It's mostly OK. Please prefix the subject line with "BUILD:" so that we
know it affects the build system (just run "git log Makefile" to see
what we do), but that's just a cosmetic detail.

Thanks,
Willy



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 02:35:24PM +0300, Ciprian Dorin Craciun wrote:
> One question about this:  if the client gradually reads from the
> (server side) buffer, but it doesn't completely clears it, having this
> `TCP_USER_TIMEOUT` configured would consider this connection "live"?

yes, that's it.

> More specifically, say there is 4MB in the server buffer and the
> client "consumes" (i.e. acknowledges) only small parts of it, would
> the timeout apply as:
> (A) until the entire buffer is cleared, or
> (B) until at least "some" amount of data is read;

The timeout is an inactivity period. So let's say you set 10s in tcp-ut,
it would only kill the connection if the client acks nothing in 10s, even
if it takes 3 minutes to dump the whole buffer. It's mostly used in
environments with very long connections where clients may disappear
without warning, such as websocket connections or webmails.

Willy



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 2:22 PM Willy Tarreau  wrote:
> > As seen the timeout which I believe is the culprit is the `timeout
> > client 30s` which I guess is quite enough.
>
> I tend to consider that if the response starts to be sent,
> then the most expensive part was done and it'd better be completed
> otherwise the client will try again and inflict the same cost to the
> server again.


I prefer shorter timeout values because on the server side I have
uWSGI with Python, and with its default model (one process / request
at one time), having long outstanding connections could degrade the
user experience.


> You should probably increase this enough so that you
> don't see unexpected timeouts anymore, and rely on tcp-ut to cut early
> if a client doesn't read the data.


One question about this:  if the client gradually reads from the
(server side) buffer, but it doesn't completely clears it, having this
`TCP_USER_TIMEOUT` configured would consider this connection "live"?
More specifically, say there is 4MB in the server buffer and the
client "consumes" (i.e. acknowledges) only small parts of it, would
the timeout apply as:
(A) until the entire buffer is cleared, or
(B) until at least "some" amount of data is read;

Thanks,
Ciprian.



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 12:23:20PM +0300, Ciprian Dorin Craciun wrote:
> On Sun, Sep 30, 2018 at 12:12 PM Willy Tarreau  wrote:
> > > Anyway, why am I trying to configure the sending buffer size:  if I
> > > have large downloads and I have (some) slow clients, and as a
> > > consequence HAProxy times out waiting for the kernel buffer to clear.
> >
> > Thus you might have very short timeouts! Usually it's not supposed to
> > be an issue.
> 
> I wouldn't say they are "small":
> 
> timeout server 60s
> timeout server-fin 6s
> timeout client 30s
> timeout client-fin 6s
> timeout tunnel 180s
> timeout connect 6s
> timeout queue 30s
> timeout check 6s
> timeout tarpit 30s
> 
> 
> As seen the timeout which I believe is the culprit is the `timeout
> client 30s` which I guess is quite enough.

It's enough for a 2 Mbps bandwidth on the client, not for less. I don't
see the point is setting too short timeouts on the client side for data
transfers, I tend to consider that if the response starts to be sent,
then the most expensive part was done and it'd better be completed
otherwise the client will try again and inflict the same cost to the
server again. You should probably increase this enough so that you
don't see unexpected timeouts anymore, and rely on tcp-ut to cut early
if a client doesn't read the data.

Willy



Allow configuration of pcre-config path

2018-09-30 Thread Fabrice Fontaine
Dear all,

I added haproxy to buildroot and to do so, I added a way of configuring the
path of pcre-config and pcre-config2. So, please find attached a patch. As
this my first contribution to haproxy, please excuse me if I made any
mistakes.

Best Regards,

Fabrice
From f3dcdf6c9ffea4d9b89dca9706a48c44bd76c470 Mon Sep 17 00:00:00 2001
From: Fabrice Fontaine 
Date: Fri, 28 Sep 2018 19:21:26 +0200
Subject: [PATCH] Allow configuration of pcre-config path

Add PCRE_CONFIGDIR variable to allow the user to configure path of
pcre-config or pcre-config2 instead of using the one in his path.
This is particular useful when cross-compiling.

Signed-off-by: Fabrice Fontaine 
---
 Makefile | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 382f944f..7c31f1ba 100644
--- a/Makefile
+++ b/Makefile
@@ -78,6 +78,7 @@
 # Other variables :
 #   DLMALLOC_SRC   : build with dlmalloc, indicate the location of dlmalloc.c.
 #   DLMALLOC_THRES : should match PAGE_SIZE on every platform (default: 4096).
+#   PCRE_CONFIGDIR : force the path to pcre-config or pcre-config2
 #   PCREDIR: force the path to libpcre.
 #   PCRE_LIB   : force the lib path to libpcre (defaults to $PCREDIR/lib).
 #   PCRE_INC   : force the include path to libpcre ($PCREDIR/inc)
@@ -734,7 +735,7 @@ endif
 # Forcing PCREDIR to an empty string will let the compiler use the default
 # locations.
 
-PCREDIR	:= $(shell pcre-config --prefix 2>/dev/null || echo /usr/local)
+PCREDIR	:= $(shell $(PCRE_CONFIGDIR)pcre-config --prefix 2>/dev/null || echo /usr/local)
 ifneq ($(PCREDIR),)
 PCRE_INC:= $(PCREDIR)/include
 PCRE_LIB:= $(PCREDIR)/lib
@@ -759,7 +760,7 @@ endif
 endif
 
 ifneq ($(USE_PCRE2)$(USE_STATIC_PCRE2)$(USE_PCRE2_JIT),)
-PCRE2DIR	:= $(shell pcre2-config --prefix 2>/dev/null || echo /usr/local)
+PCRE2DIR	:= $(shell $(PCRE_CONFIGDIR)pcre2-config --prefix 2>/dev/null || echo /usr/local)
 ifneq ($(PCRE2DIR),)
 PCRE2_INC   := $(PCRE2DIR)/include
 PCRE2_LIB   := $(PCRE2DIR)/lib
@@ -777,7 +778,7 @@ endif
 endif
 
 
-PCRE2_LDFLAGS	:= $(shell pcre2-config --libs$(PCRE2_WIDTH) 2>/dev/null || echo -L/usr/local/lib -lpcre2-$(PCRE2_WIDTH))
+PCRE2_LDFLAGS	:= $(shell $(PCRE_CONFIGDIR)pcre2-config --libs$(PCRE2_WIDTH) 2>/dev/null || echo -L/usr/local/lib -lpcre2-$(PCRE2_WIDTH))
 
 ifeq ($(PCRE2_LDFLAGS),)
 $(error libpcre2-$(PCRE2_WIDTH) not found)
-- 
2.17.1



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 12:12 PM Willy Tarreau  wrote:
> > If so then by not setting it the kernel should choose the default
> > value, which according to:
> > 
> > > sysctl net.ipv4.tcp_wmem
> > net.ipv4.tcp_wmem = 409616384   4194304
> > 
> > , should be 16384.
>
> No, it *starts* at 16384 then grows up to the configured limit depending
> on the ability to do so without losses and the available memory.


OK.  The Linux man-page eludes this part...  Good to know.  :)




> > Anyway, why am I trying to configure the sending buffer size:  if I
> > have large downloads and I have (some) slow clients, and as a
> > consequence HAProxy times out waiting for the kernel buffer to clear.
>
> Thus you might have very short timeouts! Usually it's not supposed to
> be an issue.

I wouldn't say they are "small":

timeout server 60s
timeout server-fin 6s
timeout client 30s
timeout client-fin 6s
timeout tunnel 180s
timeout connect 6s
timeout queue 30s
timeout check 6s
timeout tarpit 30s


As seen the timeout which I believe is the culprit is the `timeout
client 30s` which I guess is quite enough.


> > However if I configure the buffer size small enough it seems HAProxy
> > is "kept bussy" and nothing breaks.
>
> I see but then maybe you should simply lower the tcp_wmem max value a
> little bit, or increase your timeout ?

I'll try to experiment with `tcp_wmem max` as you've suggested.


> > Thus, is there a way to have both OK bandwidth for normal clients, and
> > not timeout for slow clients?
>
> That's exactly the role of the TCP stack. It measures RTT and losses and
> adjusts the send window accordingly. You must definitely let the TCP
> stack play its role there, you'll have much less problems. Even if you
> keep 4 MB as the max send window, for a 1 Mbps client that's rougly 40
> seconds of transfer. You can deal with this using much larger timeouts
> (1 or 2 minutes), and configure the tcp-ut value on the bind line to
> get rid of clients which do not ACK the data they're being sent at the
> TCP level.

I initially let the TCP "do its thing", but it got me into trouble
with poor wireless clients...

I'll also give `tcp-ut` as suggested.

Thanks,
Ciprian.



Re: [ANNOUNCE] haproxy-1.9-dev3

2018-09-30 Thread Aleksandar Lazic
Hi Willy.

Am 30.09.2018 um 11:05 schrieb Willy Tarreau:
> Hi Aleks,
> 
> On Sun, Sep 30, 2018 at 10:38:20AM +0200, Aleksandar Lazic wrote:
>> Do you have any release date for 1.9, as I plan to launch some new site and
>> thought to use 1.9 from beginning because it sounds like that 1.9 will be 
>> able
>> to handle h2 with the backend.
> 
> It's initially planned for end of October/early November, but I think we'll
> stretch the months a little bit. The extremely difficult part is the rework
> of the HTTP engine to migrate to the native internal representation which
> is needed to transport H2 semantics from end to end. While a huge amount
> of work has been done on this, it also uncovered some very old design
> heritage that needed to be replaced and that takes time to address, such
> as the changes to logging and error snapshots to make them work out of
> streams, or the change of connection orientation which we initially expected
> to postpone after 1.9 but that we discovered late is mandatory to finish the
> work, and the change of the idle connections that's needed to maintain
> keep-alive on the backend side.
> 
> These changes have a huge impact on the code and the architecture, so as
> per the technical vs functional release cycle, I'd really want to have
> this in 1.9 so that we have all the basis for much cleaner and calmer
> development for 2.0. But I'm sure we will face yet more surprises.
> 
> Thus if we see that it's definitely not workable to complete these changes
> by ~November, we'll possibly release without them but will put all of them
> in a -next branch that we'll merge soon after the release. However if we
> manage to have something almost done, I'm willing to push the deadline a
> little bit further to let this be finished. Christopher suggested that we
> might have a 3rd option which is to have the two implementations side by
> side and that we decide by configuration which one to use depending on
> the desired features. That's indeed an option (a temporary one) but I
> don't like it much due to the risk of increased complexity with bug
> reports. That's still definitely something to keep in mind anyway.

I agree here with you.

> I sincerely hope it's the last time we engage in such complex changes in a
> single version! I got caught several years ago during the 1.5 development
> and this time it's even more complex than what we had to redesign by then!

Well when I think back to 2003 haproxy is now complete different, cool 
evolution ;-)

> Hoping this clarifies the situation a bit,

Yes definitely.
I will start with 1.8 just to be on the save site.

Thank you for your always detailed answer. ;-)

> Willy

Best regards
Aleks




Re: [ANNOUNCE] haproxy-1.9-dev3

2018-09-30 Thread Willy Tarreau
Hi Aleks,

On Sun, Sep 30, 2018 at 10:38:20AM +0200, Aleksandar Lazic wrote:
> Do you have any release date for 1.9, as I plan to launch some new site and
> thought to use 1.9 from beginning because it sounds like that 1.9 will be able
> to handle h2 with the backend.

It's initially planned for end of October/early November, but I think we'll
stretch the months a little bit. The extremely difficult part is the rework
of the HTTP engine to migrate to the native internal representation which
is needed to transport H2 semantics from end to end. While a huge amount
of work has been done on this, it also uncovered some very old design
heritage that needed to be replaced and that takes time to address, such
as the changes to logging and error snapshots to make them work out of
streams, or the change of connection orientation which we initially expected
to postpone after 1.9 but that we discovered late is mandatory to finish the
work, and the change of the idle connections that's needed to maintain
keep-alive on the backend side.

These changes have a huge impact on the code and the architecture, so as
per the technical vs functional release cycle, I'd really want to have
this in 1.9 so that we have all the basis for much cleaner and calmer
development for 2.0. But I'm sure we will face yet more surprises.

Thus if we see that it's definitely not workable to complete these changes
by ~November, we'll possibly release without them but will put all of them
in a -next branch that we'll merge soon after the release. However if we
manage to have something almost done, I'm willing to push the deadline a
little bit further to let this be finished. Christopher suggested that we
might have a 3rd option which is to have the two implementations side by
side and that we decide by configuration which one to use depending on
the desired features. That's indeed an option (a temporary one) but I
don't like it much due to the risk of increased complexity with bug
reports. That's still definitely something to keep in mind anyway.

I sincerely hope it's the last time we engage in such complex changes in a
single version! I got caught several years ago during the 1.5 development
and this time it's even more complex than what we had to redesign by then!

Hoping this clarifies the situation a bit,
Willy



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 11:41 AM Ciprian Dorin Craciun
 wrote:
> > - tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", 
> > meaning unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s
> > - do the math with your value of 131072 and you will have get your ~800 
> > KB/s.


However, something bothers me...  Setting `tune.sndbuf.client`, is
used only to call `setsockopt (SO_SNDBUF)`, right?  It is not used by
HAProxy for any internal buffer size?

If so then by not setting it the kernel should choose the default
value, which according to:

> sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 409616384   4194304

, should be 16384.

Looking with `netstat` at the `Recv-Q` column, it seems that with no
`tune` setting the value even goes up to 5 MB.
However setting the `tune` parameter it always goes up to around 20 KB.




Anyway, why am I trying to configure the sending buffer size:  if I
have large downloads and I have (some) slow clients, and as a
consequence HAProxy times out waiting for the kernel buffer to clear.
However if I configure the buffer size small enough it seems HAProxy
is "kept bussy" and nothing breaks.

Thus, is there a way to have both OK bandwidth for normal clients, and
not timeout for slow clients?

Thanks,
Ciprian.



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 11:33 AM Mathias Weiersmüller
 wrote:
> Sorry for the extremly brief answer:
> - you mentioned you have 160 ms latency.

Yes, I have mentioned this because I've read somewhere (not
remembering now where), that the `SO_SNDBUF` socket option also
impacts the TCP window size.


> - tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", 
> meaning unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s
> - do the math with your value of 131072 and you will have get your ~800 KB/s.
> - no hidden voodoo happening here: read about BDP (Bandwidth Delay Product)

Please don't get me wrong:  I didn't imply any "voodoo".  :)

When I asked if there is some "hidden" consequence I didn't meant it
as "magic", but as a question for what other (unknown to me)
consequences there are.

And it seems that the `tune.sndbuf.client` also limits the TCP window size.

So my question is how can I (if at all possible) configure the buffer
size witout "breaking" the TCP window size?

Thanks,
Ciprian.



Re: [ANNOUNCE] haproxy-1.9-dev3

2018-09-30 Thread Aleksandar Lazic
Am 29.09.2018 um 20:41 schrieb Willy Tarreau:
> Subject: [ANNOUNCE] haproxy-1.9-dev3
> To: haproxy@formilux.org
> 
> Hi,
> 
> Now that Kernel Recipes is over (it was another awesome edition), I'm back
> to my haproxy activities. Well, I was pleased to see that my coworkers
> reserved me a nice surprise by fixing the pending bugs that were plaguing
> dev2. I should go to conferences more often, maybe it's a message from
> them to make me understand I'm disturbing them when I'm at the office ;-)

;-)

> So I thought that it was a good opportunity to issue dev3 now and make it
> what dev2 should have been, and forget that miserable one, eventhough I
> was told that I'll soon get another batch of patches to merge, but then
> we'll simply emit dev4 so there's no need to further delay pending fixes.
> 
> HAProxy 1.9-dev3 was released on 2018/09/29. It added 35 new commits
> after version 1.9-dev2.
> 
> There's nothing fancy here. The connection issues are supposedly addressed
> (please expect a bit more in this area soon). The HTTP/1 generic parser is
> getting smarter since we're reimplementing the features that were in the
> old HTTP code (content-length and transfer-encoding now handled). Lua now
> can access stick-tables. I haven't checked precisely how but I saw that
> Adis updated the doc so all info should be there.
> 
> Ah, a small change is that we now build with -Wextra after having addressed
> all warnings reported up to gcc 7.3 and filtered a few useless ones. If you
> get some build warnings, please report them along with your gcc version and
> your build options. I personally build with -Werror in addition to this one,
> and would like to keep this principle to catch certain bugs or new compiler
> jokes earlier in the future.
> 
> As usual, this is an early development version. It's fine if you want to
> test the changes, but avoid putting this into production if it can cost
> you your job!

Do you have any release date for 1.9, as I plan to launch some new site and
thought to use 1.9 from beginning because it sounds like that 1.9 will be able
to handle h2 with the backend.

> Please find the usual URLs below :
>Site index   : http://www.haproxy.org/
>Discourse: http://discourse.haproxy.org/
>Sources  : http://www.haproxy.org/download/1.9/src/
>Git repository   : http://git.haproxy.org/git/haproxy.git/
>Git Web browsing : http://git.haproxy.org/?p=haproxy.git
>Changelog: http://www.haproxy.org/download/1.9/src/CHANGELOG
>Cyril's HTML doc : http://cbonte.github.io/haproxy-dconv/

Docker Image is updated.
https://hub.docker.com/r/me2digital/haproxy19/

> Willy

Regards
Aleks

> ---
> Complete changelog :
> Adis Nezirovic (1):
>   MEDIUM: lua: Add stick table support for Lua.
> 
> Bertrand Jacquin (1):
>   DOC: Fix typos in lua documentation
> 
> Christopher Faulet (3):
>   MINOR: h1: Add H1_MF_XFER_LEN flag
>   BUG/MEDIUM: h1: Really skip all updates when incomplete messages are 
> parsed
>   BUG/MEDIUM: http: Don't parse chunked body if there is no input data
> 
> Dragan Dosen (1):
>   BUG/MEDIUM: patterns: fix possible double free when reloading a pattern 
> list
> 
> Moemen MHEDHBI (1):
>   DOC: Update configuration doc about the maximum number of stick 
> counters.
> 
> Olivier Houchard (4):
>   BUG/MEDIUM: process_stream: Don't use si_cs_io_cb() in process_stream().
>   MINOR: h2/stream_interface: Reintroduce te wake() method.
>   BUG/MEDIUM: h2: Wake the task instead of calling h2_recv()/h2_process().
>   BUG/MEDIUM: process_stream(): Don't wake the task if no new data was 
> received.
> 
> Willy Tarreau (24):
>   BUG/MINOR: h1: don't consider the status for each header
>   MINOR: h1: report in the h1m struct if the HTTP version is 1.1 or above
>   MINOR: h1: parse the Connection header field
>   MINOR: http: add http_hdr_del() to remove a header from a list
>   MINOR: h1: add headers to the list after controls, not before
>   MEDIUM: h1: better handle transfer-encoding vs content-length
>   MEDIUM: h1: deduplicate the content-length header
>   CLEANUP/CONTRIB: hpack: remove some h1 build warnings
>   BUG/MINOR: tools: fix set_net_port() / set_host_port() on IPv4
>   BUG/MINOR: cli: make sure the "getsock" command is only called on 
> connections
>   MINOR: stktable: provide an unchecked version of stktable_data_ptr()
>   MINOR: stream-int: make si_appctx() never fail
>   BUILD: ssl_sock: remove build warnings on potential null-derefs
>   BUILD: stats: remove build warnings on potential null-derefs
>   BUILD: stream: address null-deref build warnings at -Wextra
>   BUILD: http: address a couple of null-deref warnings at -Wextra
>   BUILD: log: silent build warnings due to unchecked 
> __objt_{server,applet}
>   BUILD: dns: fix null-deref build warning at -Wextra
>   BUILD: checks: silence a null-deref build warning at 

AW: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Mathias Weiersmüller
However the bandwidth behaviour is exactly the same:
* no `tune.sndbuf.client`, bandwidth goes up to 11 MB/s for a large download;
* with `tune.sndbuf.client 16384` it goes up to ~110 KB/s;
* with `tune.sndbuf.client 131072` it goes up to ~800 KB/s;
* with `tune.sndbuf.client 262144` it goes up to ~1400 KB/s; (These are 
bandwidths obtained after the TCP window has "settled".)

It seems there is a liniar correlation between that tune parameter and the 
bandwidth.


However due to the fact that I get the same behaviour both with and without 
offloading, I wonder if there isn't somehow a "hidden"
consequence of setting this `tune.sndbuf.client` parameter?

==

Sorry for the extremly brief answer:
- you mentioned you have 160 ms latency.
- tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", meaning 
unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s 
- do the math with your value of 131072 and you will have get your ~800 KB/s.
- no hidden voodoo happening here: read about BDP (Bandwidth Delay Product)

Cheers

Matti



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 10:35 AM Willy Tarreau  wrote:
> Note that these are not fragments but segments. And as Matti suggested,
> it's indeed due to GSO, you're seeing two TCP frames sent at once through
> the stack, and they will be segmented by the NIC.

I have disabled all offloading features:

tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off


Now I see "as expected" Ethernet frames with `tcpdump` / `Wireshark`.
(There is indeed however a bump in kernel CPU usage.)


However the bandwidth behaviour is exactly the same:
* no `tune.sndbuf.client`, bandwidth goes up to 11 MB/s for a large download;
* with `tune.sndbuf.client 16384` it goes up to ~110 KB/s;
* with `tune.sndbuf.client 131072` it goes up to ~800 KB/s;
* with `tune.sndbuf.client 262144` it goes up to ~1400 KB/s;
(These are bandwidths obtained after the TCP window has "settled".)

It seems there is a liniar correlation between that tune parameter and
the bandwidth.


However due to the fact that I get the same behaviour both with and
without offloading, I wonder if there isn't somehow a "hidden"
consequence of setting this `tune.sndbuf.client` parameter?

Thanks,
Ciprian.



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 10:35 AM Willy Tarreau  wrote:
> On Sun, Sep 30, 2018 at 10:20:06AM +0300, Ciprian Dorin Craciun wrote:
> > I was just trying to replicate the issue I've seen yesterday, and for
> > a moment (in initial tests) I was able to.  However on repeated tests
> > it seems that the `tune.rcvbuf.*` (and related) have no impact, as I
> > constantly see TCP fragments (around 2842 bytes Ethernet frames).
>
> Note that these are not fragments but segments. And as Matti suggested,
> it's indeed due to GSO, you're seeing two TCP frames sent at once through
> the stack, and they will be segmented by the NIC.


[Just as info.]

So it seems I was able to reproduce the bandwith issue by only toying
with `tune.sndbuf.client`:
* with no value, downloading an 8 MB file I get decent bandwidth
around 4 MB/s;  (for larger files I even get up to 10 MB/s);  (a
typical Ethernet frame length as reported by `tcpdump` is around 59 KB
towards the end of the transfer;)
* with that tune parameter set to 128 KB, I get around 1 MB/s;  (a
typical Ethernet frame length is around 4 KB;)
* with that tune parameter set to 16 KB, I get around 100 KB/s;  (a
typical Ethernet frame lengh is around 2KB;)


By "typical Ethernet frame length" I meen a packet as reported by
`tcpdump` and viewed in Wrieshark looks like this (for the first one):

Frame 1078: 59750 bytes on wire (478000 bits), 59750 bytes captured
(478000 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Sep 30, 2018 10:26:58.667739000 EEST
[Time shift for this packet: 0.0 seconds]
Epoch Time: 1538292418.667739000 seconds
[Time delta from previous captured frame: 0.18000 seconds]
[Time delta from previous displayed frame: 0.18000 seconds]
[Time since reference or first frame: 1.901135000 seconds]
Frame Number: 1078
Frame Length: 59750 bytes (478000 bits)
Capture Length: 59750 bytes (478000 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp:ssl:ssl]
[Coloring Rule Name: TCP]
[Coloring Rule String: tcp]
Ethernet II, Src: f2:3c:91:9f:51:b8 (f2:3c:91:9f:51:b8), Dst:
Cisco_9f:f0:0a (00:00:0c:9f:f0:0a)
Destination: Cisco_9f:f0:0a (00:00:0c:9f:f0:0a)
Source: f2:3c:91:9f:51:b8 (f2:3c:91:9f:51:b8)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: XXX.XXX.XXX.XXX, Dst: XXX.XXX.XXX.XXX
0100  = Version: 4
 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
 00.. = Differentiated Services Codepoint: Default (0)
 ..00 = Explicit Congestion Notification: Not ECN-Capable
Transport (0)
Total Length: 59736
Identification: 0x8d7a (36218)
Flags: 0x4000, Don't fragment
0...    = Reserved bit: Not set
.1..    = Don't fragment: Set
..0.    = More fragments: Not set
...0    = Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0x3054 [validation disabled]
[Header checksum status: Unverified]
Source: XXX.XXX.XXX.XXX
Destination: XXX.XXX.XXX.XXX
Transmission Control Protocol, Src Port: 443, Dst Port: 38150, Seq:
8271805, Ack: 471, Len: 59684
Source Port: 443
Destination Port: 38150
[Stream index: 0]
[TCP Segment Len: 59684]
Sequence number: 8271805(relative sequence number)
[Next sequence number: 8331489(relative sequence number)]
Acknowledgment number: 471(relative ack number)
1000  = Header Length: 32 bytes (8)
Flags: 0x010 (ACK)
Window size value: 234
[Calculated window size: 29952]
[Window size scaling factor: 128]
Checksum: 0x7d1c [unverified]
[Checksum Status: Unverified]
Urgent pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
[SEQ/ACK analysis]
[Timestamps]
TCP payload (59684 bytes)
TCP segment data (16095 bytes)
TCP segment data (10779 bytes)
[2 Reassembled TCP Segments (16405 bytes): #1073(310), #1078(16095)]
[Frame: 1073, payload: 0-309 (310 bytes)]
[Frame: 1078, payload: 310-16404 (16095 bytes)]
[Segment count: 2]
[Reassembled TCP length: 16405]
Secure Sockets Layer
Secure Sockets Layer


I'll try to disable offloading and see what happens.

I forgot to say that this is a paravirtualized VM running on Linode in
their Dallas datacenter.

Ciprian.



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 10:20:06AM +0300, Ciprian Dorin Craciun wrote:
> On Sun, Sep 30, 2018 at 10:06 AM Mathias Weiersmüller
>  wrote:
> > I am pretty sure you have TCP segmentation offload enabled. The TCP/IP
> > stack therefore sends bigger-than-allowed TCP segments towards the NIC who
> > in turn takes care about the proper segmentation.
> 
> I was just trying to replicate the issue I've seen yesterday, and for
> a moment (in initial tests) I was able to.  However on repeated tests
> it seems that the `tune.rcvbuf.*` (and related) have no impact, as I
> constantly see TCP fragments (around 2842 bytes Ethernet frames).

Note that these are not fragments but segments. And as Matti suggested,
it's indeed due to GSO, you're seeing two TCP frames sent at once through
the stack, and they will be segmented by the NIC.

> > You want to check the output of "ethtool -k eth0" and the values of:
> > tcp-segmentation-offload
> > generic-segmentation-offload
> 
> The output of `ethtool -k eth0` is bellow:
> 
> tcp-segmentation-offload: on
> tx-tcp-segmentation: on
> tx-tcp-ecn-segmentation: on
> tx-tcp-mangleid-segmentation: off
> tx-tcp6-segmentation: on
> generic-segmentation-offload: on
> 

Indeed.

Willy



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Willy Tarreau
On Sun, Sep 30, 2018 at 07:06:29AM +, Mathias Weiersmüller wrote:
> I am pretty sure you have TCP segmentation offload enabled. The TCP/IP stack
> therefore sends bigger-than-allowed TCP segments towards the NIC who in turn
> takes care about the proper segmentation.
> 
> You want to check the output of "ethtool -k eth0" and the values of:
> tcp-segmentation-offload
> generic-segmentation-offload

Yep totally agreed, as soon as you have either GSO or TSO, you will see
large frames. Ciprian in this case it's better to capture from another
machine in the path to get a reliable capture. You can also disable
TSO/GSO using ethtool -K, but be prepared to see a significant bump in
CPU usage. Don't do this if you are already running above 20% CPU usage
on average.

Willy



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 10:06 AM Mathias Weiersmüller
 wrote:
> I am pretty sure you have TCP segmentation offload enabled. The TCP/IP stack 
> therefore sends bigger-than-allowed TCP segments towards the NIC who in turn 
> takes care about the proper segmentation.

I was just trying to replicate the issue I've seen yesterday, and for
a moment (in initial tests) I was able to.  However on repeated tests
it seems that the `tune.rcvbuf.*` (and related) have no impact, as I
constantly see TCP fragments (around 2842 bytes Ethernet frames).


> You want to check the output of "ethtool -k eth0" and the values of:
> tcp-segmentation-offload
> generic-segmentation-offload

The output of `ethtool -k eth0` is bellow:

tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
generic-segmentation-offload: on


Thanks,
Ciprian.



AW: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Mathias Weiersmüller
I am pretty sure you have TCP segmentation offload enabled. The TCP/IP stack 
therefore sends bigger-than-allowed TCP segments towards the NIC who in turn 
takes care about the proper segmentation.

You want to check the output of "ethtool -k eth0" and the values of:
tcp-segmentation-offload
generic-segmentation-offload

Cheers

Mathias


-Ursprüngliche Nachricht-
Von: Ciprian Dorin Craciun  
Gesendet: Sonntag, 30. September 2018 08:30
An: w...@1wt.eu
Cc: haproxy@formilux.org
Betreff: Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their 
`tune.*.client` equivalents) lead to TCP fragmentation?

On Sun, Sep 30, 2018 at 9:08 AM Willy Tarreau  wrote:
> > I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`, 
> > `tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set 
> > them to various values ranging from 4k to 256k.  Unfortunately in 
> > all cases it seems that this generates too large TCP packets (larger 
> > than the advertised and agreed MSS in both direction), which in turn 
> > leads to TCP fragmentation and reassembly.  (Both client and server 
> > are Linux
> > >4.10.  The protocol used was HTTP 1.1 over TLS 1.2.)
>
> No no no, I'm sorry but this is not possible at all. You will never 
> find a single TCP stack doing this! I'm pretty sure there is an issue 
> somewhere in your capture or analysis.
>
> [...]
>
> However, if the problem you're experiencing is only with the listening 
> side, there's an "mss" parameter that you can set on your "bind" lines 
> to enforce a lower MSS, it may be a workaround in your case. I'm 
> personally using it at home to reduce the latency over ADSL ;-)


I am also extreemly sckeptical that this is HAProxy's fault, however the only 
change needed to eliminate this issue was commenting-out these tune arguments.  
I have also explicitly set the `mss` parameter to `1400`.

The catpure was taken directly on the server on public interface.

I'll try to make a fresh catpure to see if I can replicate this.


> > The resulting bandwidth was around 10 MB.
>
> Please use correct units when reporting issues, in order to reduce the 
> confusion. "10 MB" is not a bandwidth but a size (10 megabytes). Most 
> likely you want to mean 10 megabytes per second (10 MB/s). But maybe 
> you even mean 10 megabits per second (10 Mb/s or 10 Mbps), which 
> equals
> 1.25 MB/s.

:)  Sorry for that.  (Thats the otucome of writing emails at 3 AM after 4 hours 
of pocking into a production system.)  I completely agree with you about the 
MB/Mb consistency, and I always hate that some providers still use MB to mean 
mega-bits, like it's 2000.  :)

Yes, I meant 10 mega-bytes / second.  Sory again.

Ciprian.



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 9:08 AM Willy Tarreau  wrote:
> > I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`,
> > `tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set them
> > to various values ranging from 4k to 256k.  Unfortunately in all cases
> > it seems that this generates too large TCP packets (larger than the
> > advertised and agreed MSS in both direction), which in turn leads to
> > TCP fragmentation and reassembly.  (Both client and server are Linux
> > >4.10.  The protocol used was HTTP 1.1 over TLS 1.2.)
>
> No no no, I'm sorry but this is not possible at all. You will never find
> a single TCP stack doing this! I'm pretty sure there is an issue somewhere
> in your capture or analysis.
>
> [...]
>
> However, if the problem you're experiencing is only with the listening
> side, there's an "mss" parameter that you can set on your "bind" lines
> to enforce a lower MSS, it may be a workaround in your case. I'm
> personally using it at home to reduce the latency over ADSL ;-)


I am also extreemly sckeptical that this is HAProxy's fault, however
the only change needed to eliminate this issue was commenting-out
these tune arguments.  I have also explicitly set the `mss` parameter
to `1400`.

The catpure was taken directly on the server on public interface.

I'll try to make a fresh catpure to see if I can replicate this.


> > The resulting bandwidth was around 10 MB.
>
> Please use correct units when reporting issues, in order to reduce the
> confusion. "10 MB" is not a bandwidth but a size (10 megabytes). Most
> likely you want to mean 10 megabytes per second (10 MB/s). But maybe
> you even mean 10 megabits per second (10 Mb/s or 10 Mbps), which equals
> 1.25 MB/s.

:)  Sorry for that.  (Thats the otucome of writing emails at 3 AM
after 4 hours of pocking into a production system.)  I completely
agree with you about the MB/Mb consistency, and I always hate that
some providers still use MB to mean mega-bits, like it's 2000.  :)

Yes, I meant 10 mega-bytes / second.  Sory again.

Ciprian.



Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

2018-09-30 Thread Willy Tarreau
Hi Ciprian,

On Sat, Sep 29, 2018 at 09:57:20PM +0300, Ciprian Dorin Craciun wrote:
> Hello all!
> 
> I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`,
> `tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set them
> to various values ranging from 4k to 256k.  Unfortunately in all cases
> it seems that this generates too large TCP packets (larger than the
> advertised and agreed MSS in both direction), which in turn leads to
> TCP fragmentation and reassembly.  (Both client and server are Linux
> >4.10.  The protocol used was HTTP 1.1 over TLS 1.2.)

No no no, I'm sorry but this is not possible at all. You will never find
a single TCP stack doing this! I'm pretty sure there is an issue somewhere
in your capture or analysis.

MSS is the maximum segment size and corresponds to the maximum *payload*
transported over TCP. It doesn't include the IP nor TCP headers. Usually
over Ethernet it's 1460, resulting in 1500 bytes packets. If you're seeing
fragments, it very likely is due to an intermediary router or firewall
which has a shorter MTU at some point, such as an IPSEC VPN, IP tunnel
or ADSL link, and which must fragment to deliver the data. Some such
equipments are capable of interfering with the MSS negociation to reduce
it to fit the MTU reduction, you need to check on the affected equipments.

Also, regarding your initial question, tune.rcvbuf/sndbuf will have no
effect on all this since they only specify the extra buffer size in the
system.

However, if the problem you're experiencing is only with the listening
side, there's an "mss" parameter that you can set on your "bind" lines
to enforce a lower MSS, it may be a workaround in your case. I'm
personally using it at home to reduce the latency over ADSL ;-)

> The resulting bandwidth was around 10 MB.

Please use correct units when reporting issues, in order to reduce the
confusion. "10 MB" is not a bandwidth but a size (10 megabytes). Most
likely you want to mean 10 megabytes per second (10 MB/s). But maybe
you even mean 10 megabits per second (10 Mb/s or 10 Mbps), which equals
1.25 MB/s.

Regards,
Willy