Re: Upgrade from 1.7 to 2.0 = increased CPU usage

2019-07-24 Thread Willy Tarreau
On Thu, Jul 25, 2019 at 02:36:49AM +0200, Elias Abacioglu wrote:
> Hi Willy,
> 
> This would explain the 503s
> ```
>   # change a 503 response into a 204(a friendly decline).
>   errorfile 503 /etc/haproxy/errors/204.http
> 
>   acl is_disable path_beg /getuid/rogue-ad-exchange
>   # http-request deny defaults to 403, change it to a 503,
>   # which is a masked 204 since haproxy doesn't have a 204 errorfile.
>   http-request deny deny_status 503 if is_disable
> ```
> also
> ```
> backend robotstxt
>   errorfile 503 /etc/haproxy/errors/200.robots.http
> backend crossdomainxml
>   errorfile 503 /etc/haproxy/errors/200.crossdomain.http
> backend emptygif
>   errorfile 503 /etc/haproxy/errors/200.emptygif.http
> ```
> Basically I use 503 if I want to block a sender in a friendly way(i.e
> making them believe we just declined the transaction) and to host 3 tiny
> files, robots.txt, crossdomain.xml and empty.gif.

But I'm pretty sure I've seen 503s *received* by haproxy, indicating
that the next component sent them, so this cannot be the ones you
produce by your configuration.

> It felt excessive to setup redundant webservers for a total of 703 bytes of
> files and also it felt wasteful to have it in the java backend. So I
> cheated haproxys errorfiles.

Oh don't worry you're not the only one to do that :-)  I've even seen
an auto-generated config using one backend per file and an error file
matching the contents of each file of a directory, to replace a web
server!

> So I don't think that the 503 causes retries for our clients, it's just me
> abusing haproxy.

I'm really speaking about 503 being received by haproxy and delivered
as 503 to the clients, not about 503s in the logs that in fact were
rewritten differently. Look here :

10:51:13.776098 recvfrom(44797, "HTTP/1.1 503 Service Unavailable"..., 16320, 0,
 NULL, NULL) = 55
10:51:13.776184 recvfrom(19524, "HTTP/1.1 503 Service Unavailable"..., 16320, 0,
 NULL, NULL) = 55
10:51:13.776272 recvfrom(57869, "HTTP/1.1 503 Service Unavailable"..., 16320, 0,
 NULL, NULL) = 55
10:51:13.776391 recvfrom(35693, "HTTP/1.1 503 Service Unavailable"..., 16320, 0,
 NULL, NULL) = 55
10:51:13.776613 recvfrom(8041, "HTTP/1.1 503 Service Unavailable"..., 16320, 0, 
NULL, NULL) = 55

then:

10:51:13.844586 sendto(61292, "HTTP/1.1 503 Service Unavailable"..., 112, MSG_DO
NTWAIT|MSG_NOSIGNAL, NULL, 0) = 112
10:51:13.844617 sendto(62213, "HTTP/1.1 503 Service Unavailable"..., 112, MSG_DO
NTWAIT|MSG_NOSIGNAL, NULL, 0) = 112
10:51:13.844646 sendto(62685, "HTTP/1.1 503 Service Unavailable"..., 112, 
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 112
10:51:13.844672 sendto(65490, "HTTP/1.1 503 Service Unavailable"..., 112, 
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 112

So this is why I was asking.

> We receive transactional requests, ad exchanges sending us requests.

OK so such services generally do not retry.

> Also real browsers connecting to us when cookie syncing.

OK.

> So the transactional we want to keep-alive so the clients sends multiple
> http requests per connection.

Of course.

> And the browser clients we want to close the connection to the client after
> it's request+response.
> So the browser clients backend have "option forceclose". Which would
> explain the short connections.

OK, makes sense.

> Currently we have "http-reuse safe" in the defaults section and "http-reuse
> never" in a tcp mode listener that forwards all :443 traffic to another set
> of haproxies that has more cores and does TLS termination. And this is to
> not mess upp the X-Forward-For headers.

There is no http-reuse in TCP mode, you probably even get a warning.

> I will try "http-reuse always" in the defaults, but not in the tcp mode
> listener as we rely on X-Forward-For.

It must have no other effect than emitting a warning for your TCP mode.
Additionally, the reuse is per request. So your XFF header will remain
valid since each request will emit its own XFF header. Reuse is only
about reusing a keep-alive connection.

> Even if I get better performance it still wouldn't answer why the HAProxy
> CPU usage (with same config) would increase with the same config in v1.7
> compared to v2.0.

That's why I was asking about whether or not the 503 can induce client
retries.

> Assuming that the "http-reuse always" might help performance in 2.0, it's
> not fair comparing a better performance tuned v2.0 vs a less tuned v1.7.

That's not my goal. I want to make sure we're not accumulating lots of
unused server-side connections in the server pools, which could in turn
make the servers sick and deliver 503s. With reuse safe this can definitely
happen, with reuse always it will not. In fact I'm really interested in
knowing if you still receive lots of 503 like this, and if you have that
many concurrent connections. In your trace I'm seeing file descriptors
as high as 84000 approximately, and if for any reason this is not normal
it could explain a difference. We could even imagine that there are
connect retries 

Send http 413 response

2019-07-24 Thread Joao Morais


Hello list. I'm trying to send a HTTP 413 to the user based on the 
hdr(Content-Length). What I've tried so far:

1. Create a http413 backend only with `errorfile 400` + `http-request 
deny_status 400`. In the frontend, configure a `use_backend http413 if 
`. This is my current approach but it is wasting some time in the 
frontend for every single request of every single backend - we have about 1000 
backends and only about 10% need to check Content Length - btw distinct content 
lengths.

2. Use the `errorfile 400` approach in the same backend that does the load 
balance. This doesn't sound good because I'm overwriting some internal response 
code and its payload. Btw, what if I need another 2, 3, 10 http response codes?

3. Use some creativity eg: errorfile 413 + deny_status 413; use_backend inside 
another backend; what more? The later doesn't make sense but the former is a 
pitty that isn't supported.

Is there another way to deny a http request with a custom status and html 
content that I’m missing? Thanks!

~jm




Re: Upgrade from 1.7 to 2.0 = increased CPU usage

2019-07-24 Thread Elias Abacioglu
Hi Willy,

This would explain the 503s
```
  # change a 503 response into a 204(a friendly decline).
  errorfile 503 /etc/haproxy/errors/204.http

  acl is_disable path_beg /getuid/rogue-ad-exchange
  # http-request deny defaults to 403, change it to a 503,
  # which is a masked 204 since haproxy doesn't have a 204 errorfile.
  http-request deny deny_status 503 if is_disable
```
also
```
backend robotstxt
  errorfile 503 /etc/haproxy/errors/200.robots.http
backend crossdomainxml
  errorfile 503 /etc/haproxy/errors/200.crossdomain.http
backend emptygif
  errorfile 503 /etc/haproxy/errors/200.emptygif.http
```
Basically I use 503 if I want to block a sender in a friendly way(i.e
making them believe we just declined the transaction) and to host 3 tiny
files, robots.txt, crossdomain.xml and empty.gif.
It felt excessive to setup redundant webservers for a total of 703 bytes of
files and also it felt wasteful to have it in the java backend. So I
cheated haproxys errorfiles.
So I don't think that the 503 causes retries for our clients, it's just me
abusing haproxy.

We receive transactional requests, ad exchanges sending us requests.
Also real browsers connecting to us when cookie syncing.
So the transactional we want to keep-alive so the clients sends multiple
http requests per connection.
And the browser clients we want to close the connection to the client after
it's request+response.
So the browser clients backend have "option forceclose". Which would
explain the short connections.
Currently we have "http-reuse safe" in the defaults section and "http-reuse
never" in a tcp mode listener that forwards all :443 traffic to another set
of haproxies that has more cores and does TLS termination. And this is to
not mess upp the X-Forward-For headers.

I will try "http-reuse always" in the defaults, but not in the tcp mode
listener as we rely on X-Forward-For.
Even if I get better performance it still wouldn't answer why the HAProxy
CPU usage (with same config) would increase with the same config in v1.7
compared to v2.0.
Assuming that the "http-reuse always" might help performance in 2.0, it's
not fair comparing a better performance tuned v2.0 vs a less tuned v1.7.

Thanks
Elias


On Wed, Jul 24, 2019 at 8:07 PM Willy Tarreau  wrote:

> Hi Elias,
>
> On Wed, Jul 24, 2019 at 11:01:22AM +0200, Elias Abacioglu wrote:
> > Hi Lukas,
> >
> > 2.0.3 still has the same issue, after 1-3 minutes it goes to using 100%
> of
> > it's available cores.
> > I've created a new strace file. Will send it to you and Willy.
>
> Thanks for testing. I've looked at your trace. I'm not seeing any abnormal
> behaviour there. However I'm seeing lots of 503 responses returned by the
> server. Could it be that your client retries on 503, leading to an increase
> of the load ? It could also possibly explain why this happens after some
> time (i.e. if the servers start to fall after some time).
>
> Also I'm seeing that you're having a lot of short connections. Maybe you're
> accumulating a large number of idle connections to the backend servers.
> Could you please try to add "http-reuse always" to your backend(s) to see
> if that improves the situation ?
>
> Thanks,
> Willy
>


2.0.3 High CPU Usage

2019-07-24 Thread ngaugler
Hello,





I am currently running Haproxy 1.6.14-1ppa1~xenial-66af4a1 2018/01/06. There 
are many features that were implemented in 1.8, 1.9 and 2.0 that would benefit 
my deployments.  I tested 2.0.3-1ppa1~xenial last night but unfortunately found 
it to be using excessive amounts of CPU and had to revert.  For this 
implementation, I have two separate use cases in haproxy:  first being external 
HTTP/HTTPS load balancing to a cluster from external clients, the second being 
HTTP internal load balancing between the two different applications (for 
simplicity sake we can call them front and back).  The excessive CPU was 
noticed on the second implementation, HTTP between the front and back 
applications.   I previously leveraged nbproc and cpu-map to isolate the use 
cases, but in 2.0 moved to nbthread (default) and cpu-map (auto) to isolate.  
The CPU usage was so excessive that I had to move the second implementation to 
two cores to not utilize 100% of the processer and still I was getting 
timeouts.  It took some time to rewrite the config files from 1.6 to 2.0 but I 
was able to get them all configured properly and leveraged top and mpstat to 
ensure threads and use cases were on the proper cores.





Because of the problems with usage case #2 I did not even get a chance to 
evaluate use case #1, but again, I use cpu-map and 'process' to isolate these 
use cases as much as possible.   Upon reverting back to 1.6 (install and 
configs) everything worked as expected.







Here is the CPU usage on 1.6 from mpstat -P ALL 5:

08:33:02 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest  %gnice   %idle 

08:33:07 PM    0    7.48    0.00   16.63    0.00    0.00    0.00    0.00    
0.00    0.00   75.88







Here is the CPU usage on 2.0.3 when using one thread:

08:29:35 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest  %gnice   %idle

08:29:40 PM   39   35.28    0.00   55.24    0.00    0.00    0.00    0.00    
0.00    0.00    9.48





Here is the CPU usage on 2.0.3 when using two threads (the front application 
still experienced timeouts to the back application even without 100% cpu 
utilization on the cores):

08:30:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest  %gnice   %idle

08:30:53 PM    0   22.93    0.00   19.75    0.00    0.00    0.00    0.00    
0.00    0.00   57.32

08:30:53 PM   39   21.60    0.00   25.10    0.00    0.00    0.00    0.00    
0.00    0.00   53.29







Also, note, our front generally keeps connections open to our back for an 
extended period of time as it pools them internally, so many requests are sent 
over the connection via HTTP/1.1 keep-alive connections.  I think we had 
roughly ~1000 connections established during these tests.





Some configurations that might be relevant to your analysis (there are more but 
they are pretty much standard, such as user, group, stats, log, chroot, etc):



global

        cpu-map auto:1/1-40 0-39



        maxconn 50



        spread-checks 2



    server-state-file global


    server-state-base /var/lib/haproxy/





defaults

    option  dontlognull 

    option  dontlog-normal

    option  redispatch



    option  tcp-smart-accept 

    option  tcp-smart-connect



    timeout connect 2s

    timeout client  50s

    timeout server  50s

    timeout client-fin 1s

    timeout server-fin 1s





This part has been sanitized and I reduced the number of servers from 14 to 2.



listen back

    bind    10.0.0.251:8080    defer-accept  process 1/40

    bind    10.0.0.252:8080    defer-accept  process 1/40

    bind    10.0.0.253:8080    defer-accept  process 1/40

    bind    10.0.0.254:8080    defer-accept  process 1/40



    mode    http

    maxconn 65000


    fullconn 65000



    balance leastconn 

    http-reuse safe



       source 10.0.1.100



   option httpchk GET /ping HTTP/1.0 

       http-check expect string OK



    server  s1     10.0.2.1:8080   check agent-check agent-port 8009 
agent-inter 250ms inter 500ms fastinter 250ms downinter 1000ms weight 100 
source 10.0.1.100

    server  s2     10.0.2.2:8080   check agent-check agent-port 8009 
agent-inter 250ms inter 500ms fastinter 250ms downinter 1000ms weight 100 
source 10.0.1.101





To configure multiple cores, I changed the bind line to add 'process 1/1' I 
also removed process 1/1 from the other use case.







The OS is Ubuntu 16.04.3 LTS, procs are 2x E5-2630, 64GB of RAM.  The output 
from haproxy -vv looked very typical between both, epoll, openssl 1.0.2g (not 
used in this case), etc.





Please let me know if there is any additional information I can provide to 
assist in isolating the cause of this issue.







Thank you!



Nick

Re: Upgrade from 1.7 to 2.0 = increased CPU usage

2019-07-24 Thread Willy Tarreau
Hi Elias,

On Wed, Jul 24, 2019 at 11:01:22AM +0200, Elias Abacioglu wrote:
> Hi Lukas,
> 
> 2.0.3 still has the same issue, after 1-3 minutes it goes to using 100% of
> it's available cores.
> I've created a new strace file. Will send it to you and Willy.

Thanks for testing. I've looked at your trace. I'm not seeing any abnormal
behaviour there. However I'm seeing lots of 503 responses returned by the
server. Could it be that your client retries on 503, leading to an increase
of the load ? It could also possibly explain why this happens after some
time (i.e. if the servers start to fall after some time).

Also I'm seeing that you're having a lot of short connections. Maybe you're
accumulating a large number of idle connections to the backend servers.
Could you please try to add "http-reuse always" to your backend(s) to see
if that improves the situation ?

Thanks,
Willy



Partnership - guest posts or sponsored content

2019-07-24 Thread Dennis P .






Hello,



I already tried to get in touch but didn't get any response... not
sure if you receved this? If not interested, please just let me know.



My name is Dennis and I'm a Blog Outreach specialist, I saw your blog
on haproxy.com and decided to get in touch. I think that a lot of my
clients will be interested in regular article placements, either paid
guest posts (we provide an article) or sponsored content (you/your team
write it for us).



This could potentially bring some good regular income for you or you
business.



Would you be interested?





Kind Regards,





















Dennis P. Outreach Specialist



email: den...@blogoutreach.net





BlogOutreach.net











https://blogoutreach.net/









Cannot enable a config "disabled" frontend via socket command

2019-07-24 Thread Martin van Es
Exactly this problem:
https://www.mail-archive.com/haproxy@formilux.org/msg19356.html

is still true for frontends, so I can't start a frontend in disabled mode and 
later on enable it via socket.

Tested version: 1.8.19 in Debian buster.

Best regards,
Martin





load-server-state-from-file "automatic" transfer?

2019-07-24 Thread Daniel Schneller
Hi!

I have been looking into load-server-state-from file to prevent 500 errors being
reported after a service reload. Currently we are seeing these, because the new
instance comes up and first wants to see the minimum configured number of health
checks for a backend server to succeed, before it hands requests to it.

From what I can tell, the state file needs to be saved manually before a service
reload, so that the new process coming up can read it back. I can do that, of 
course,
but I was wondering what the reasoning was to not have this data transferred to 
a
new process in a similar fashion as file handles or stick-tables (via peers)?

Thanks a lot!

Daniel



--
Daniel Schneller
Principal Cloud Engineer
GPG key at https://keybase.io/dschneller

CenterDevice GmbH
Rheinwerkallee 3
53227 Bonn
www.centerdevice.com
__
Geschäftsführung: Dr. Patrick Peschlow, Dr. Lukas Pustina, Michael Rosbach, 
Handelsregister-Nr.: HRB 18655, HR-Gericht: Bonn, USt-IdNr.: DE-815299431

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügte 
Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter 
Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.

Pflichtinformationen gemäß Artikel 13 DSGVO
Im Falle des Erstkontakts sind wir gemäß Art. 12, 13 DSGVO verpflichtet, Ihnen 
folgende datenschutzrechtliche Pflichtinformationen zur Verfügung zu stellen: 
Wenn Sie uns per E-Mail kontaktieren, verarbeiten wir Ihre personenbezogenen 
Daten nur, soweit an der Verarbeitung ein berechtigtes Interesse besteht (Art. 
6 Abs. 1 lit. f DSGVO), Sie in die Datenverarbeitung eingewilligt haben (Art. 6 
Abs. 1 lit. a DSGVO), die Verarbeitung für die Anbahnung, Begründung, 
inhaltliche Ausgestaltung oder Änderung eines Rechtsverhältnisses zwischen 
Ihnen und uns erforderlich ist (Art. 6 Abs. 1 lit. b DSGVO) oder eine sonstige 
Rechtsnorm die Verarbeitung gestattet. Ihre personenbezogenen Daten verbleiben 
bei uns, bis Sie uns zur Löschung auffordern, Ihre Einwilligung zur Speicherung 
widerrufen oder der Zweck für die Datenspeicherung entfällt (z. B. nach 
abgeschlossener Bearbeitung Ihres Anliegens). Zwingende gesetzliche 
Bestimmungen – insbesondere steuer- und handelsrechtliche Aufbewahrungsfristen 
– bleiben unberührt. Sie haben jederzeit das Recht, unentgeltlich Auskunft über 
Herkunft, Empfänger und Zweck Ihrer gespeicherten personenbezogenen Daten zu 
erhalten. Ihnen steht außerdem ein Recht auf Widerspruch, auf 
Datenübertragbarkeit und ein Beschwerderecht bei der zuständigen 
Aufsichtsbehörde zu. Ferner können Sie die Berichtigung, die Löschung und unter 
bestimmten Umständen die Einschränkung der Verarbeitung Ihrer personenbezogenen 
Daten verlangen. Details entnehmen Sie unserer Datenschutzerklärung 
(https://www.centerdevice.de/datenschutz/). Unseren Datenschutzbeauftragten 
erreichen Sie per E-Mail unter erdm...@sicdata.de.




signature.asc
Description: Message signed with OpenPGP


Subscribe

2019-07-24 Thread haproxy.laurent.petit
Subscribe



Re: Upgrade from 1.7 to 2.0 = increased CPU usage

2019-07-24 Thread Elias Abacioglu
Hi Lukas,

2.0.3 still has the same issue, after 1-3 minutes it goes to using 100% of
it's available cores.
I've created a new strace file. Will send it to you and Willy.

Thanks,
Elias

On Tue, Jul 23, 2019 at 8:31 PM Lukas Tribus  wrote:

> Hello Elias,
>
>
> could you try 2.0.3 please?
>
>
> It was just released today and fixes a CPU hogging issue.
>
>
> cheers,
> lukas
>


Re: FreeBSD CI builds fail

2019-07-24 Thread Willy Tarreau
On Wed, Jul 24, 2019 at 10:01:33AM +0200, Tim Düsterhus wrote:
> Am 24.07.19 um 05:55 schrieb Willy Tarreau:
> > I also noticed the build failure but couldn't find any link to the build
> > history to figure when it started to fail. How did you figure that the
> > commit above was the first one ?
> 
> While I did it as Ilya did by scrolling through GitHub's commit list,

That was the least natural way for me to do it. Thank Ilya for the
screenshot by the way. I clicked on the red cross, the the freebsd
link reporting the failure, and searched the history there but couldn't
find it.

> there is also:
> 
> Travis: https://travis-ci.com/haproxy/haproxy/builds
> Cirrus: https://cirrus-ci.com/github/haproxy/haproxy

Ah yes this one is more useful, that's what I was looking for. I just
cannot figure how to reach it when I'm on the build status page :-/

> Keep in mind for both that only the current head after a push is being
> built, so larger pushes might hide issues to CI.

Of course! But the goal is not to build every single commit either but
to detect early that something went wrong instead of discovering it after
a version is released as we used to in the past.

> In this specific case
> the offending patch was pushed together with 7764a57d3292b6b4f1e488b
> ("BUG/MEDIUM: threads: cpu-map designating a single") and only the
> latter was tested.

Yep!

> > Ideally we'd need a level of failure in CI builds. Some should be just of
> > level "info" and not cause a build error because we'd know they are likely
> > to fail but are still interested in the logs. But I don't think we can do
> > this.
> > 
> 
> I'm not sure this is possible either, but I also don't think it's a good
> idea, because then you get used to this kind of issue and ignore it. For
> example this one would probably have been written off as "ah, it's just
> flaky" instead of actually investigating what's wrong:
> https://github.com/haproxy/haproxy/issues/118

It's true. But what is also true is that the tests are not meant to be
run in the CI build environment but on developers' machines first. Being
able to run in the CI env is a bonus. As a aside effect of some technical
constraints imposed by such environments (slow VMs with flaky timings,
host enforcing at least a little bit of security, etc) we do expect that
some tests will randomly fail. These ones could be tagged as such and
just report a counter of failures among the more or less expected ones.
When you're used to see that 4 to 6 tests usually fail and suddenly you
find 13 that have failed, you can be interested in having a look there,
even if it's possibly to just start it again to confirm. And these ones
should not fail at all in more controlled environments.

There's nothing really problematic here in the end, this just constantly
reminds us that not all tests can be automated.

By the way maybe we could have some form of exclusion for tags instead
of deciding that a test only belongs to one type. Because the reality
is that we do *not* want to run certain tests. The most common ones we
don't want to run locally are "slow" and "bug", which are already
exclusive to each other. But by tagging tests with multiple labels we
could then decide to exclude some labels during the build. And in this
case we could tag some tests as "flaky-on-cirrus", "flaky-on-travis",
"flaky-in-vm", "flaky-in-container", "flaky-firewall" etc and ignore
them in such environments.

Cheers,
Willy



Re: FreeBSD CI builds fail

2019-07-24 Thread Tim Düsterhus
Willy,

Am 24.07.19 um 05:55 schrieb Willy Tarreau:
> I also noticed the build failure but couldn't find any link to the build
> history to figure when it started to fail. How did you figure that the
> commit above was the first one ?

While I did it as Ilya did by scrolling through GitHub's commit list,
there is also:

Travis: https://travis-ci.com/haproxy/haproxy/builds
Cirrus: https://cirrus-ci.com/github/haproxy/haproxy

Keep in mind for both that only the current head after a push is being
built, so larger pushes might hide issues to CI. In this specific case
the offending patch was pushed together with 7764a57d3292b6b4f1e488b
("BUG/MEDIUM: threads: cpu-map designating a single") and only the
latter was tested.

>> This one fails because there's a L4 timeout, I can probably update the regex 
>> to
>> take that into account, the interesting part is the failure and the step at
>> which it fails, but for now we expect a connection failure and not a timeout.
> 
> There's always the possibility (especially in CI environments) that some
> rules are in place on the system to prevent connections to unexpected ports.
> 
> Ideally we'd need a level of failure in CI builds. Some should be just of
> level "info" and not cause a build error because we'd know they are likely
> to fail but are still interested in the logs. But I don't think we can do
> this.
> 

I'm not sure this is possible either, but I also don't think it's a good
idea, because then you get used to this kind of issue and ignore it. For
example this one would probably have been written off as "ah, it's just
flaky" instead of actually investigating what's wrong:
https://github.com/haproxy/haproxy/issues/118

Best regards
Tim Düsterhus



haproxy=2.0.3: ereq counter grow in tcp-mode since haproxy=2.0

2019-07-24 Thread Максим Куприянов
Hi!

I've mentioned that since moving from 1.9.8 to 2.0-branch of haproxy, ereq
counter of frontend tcp-mode sections began to grow. I had zeroes in that
counter before haproxy 2.0, now the number of "error requests" is much
higher.
Example:
listen sample.service:1234
  bind ipv6@xxx:yyy
  mode tcp
  balance leastconn
  timeout server 1h
  timeout client 1h
  option  tcp-check
  default-server weight 1 inter 2s rise 3
  server server1 server1:1234 weight 100 check
  server server2 server2:1234 weight 100 check

"show errors" shows nothing:
Total events captured on [24/Jul/2019:09:56:12.544] : 0

And there are no errors in my log file also. But look at the error-counters
from the output of 'show stat':
$ echo "show stat sample.service:1234 7 -1 typed"| sudo socat
unix-connect:/var/run/haproxy.sock stdio | egrep -v :0$
F.194.0.0.pxname.1:KNS:str:sample.service:1234
F.194.0.1.svname.1:KNS:str:FRONTEND
F.194.0.5.smax.1:MMP:u32:2
F.194.0.6.slim.1:CLP:u32:4096
F.194.0.7.stot.1:MCP:u64:37
F.194.0.12.ereq.1:MCP:u64:37
F.194.0.17.status.1:SGP:str:OPEN
F.194.0.26.pid.1:KGP:u32:1
F.194.0.27.iid.1:KGS:u32:194
F.194.0.35.rate_max.1:MMP:u32:1
F.194.0.75.mode.1:CGS:str:tcp
F.194.0.78.conn_rate_max.1:MMP:u32:1
F.194.0.79.conn_tot.1:MCP:u64:37
S.194.1.0.pxname.1:KNS:str:sample.service:1234
S.194.1.1.svname.1:KNS:str:server1:1234
S.194.1.17.status.1:SGP:str:UP
S.194.1.18.weight.1:MaP:u32:100
S.194.1.19.act.1:SGP:u32:1
S.194.1.23.lastchg.1:MAP:u32:184
S.194.1.26.pid.1:KGP:u32:1
S.194.1.27.iid.1:KGS:u32:194
S.194.1.28.sid.1:KGS:u32:1
S.194.1.32.type.1:CGS:u32:2
S.194.1.36.check_status.1:MOP:str:L4OK
S.194.1.55.lastsess.1:MAP:s32:-1
S.194.1.56.last_chk.1:MOP:str:
S.194.1.65.check_desc.1:MOP:str:Layer4 check passed
S.194.1.67.check_rise.1:CGS:u32:3
S.194.1.68.check_fall.1:CGS:u32:3
S.194.1.69.check_health.1:CGS:u32:5
S.194.1.73.addr.1:CGS:str:[zzz]:yyy
S.194.1.75.mode.1:CGS:str:tcp
S.194.2.0.pxname.1:KNS:str:sample.service:1234
S.194.2.1.svname.1:KNS:str:server2:1234
S.194.2.17.status.1:SGP:str:UP
S.194.2.18.weight.1:MaP:u32:1
S.194.2.19.act.1:SGP:u32:1
S.194.2.23.lastchg.1:MAP:u32:184
S.194.2.26.pid.1:KGP:u32:1
S.194.2.27.iid.1:KGS:u32:194
S.194.2.28.sid.1:KGS:u32:2
S.194.2.32.type.1:CGS:u32:2
S.194.2.36.check_status.1:MOP:str:L4OK
S.194.2.38.check_duration.1:MDP:u64:3
S.194.2.55.lastsess.1:MAP:s32:-1
S.194.2.56.last_chk.1:MOP:str:
S.194.2.65.check_desc.1:MOP:str:Layer4 check passed
S.194.2.67.check_rise.1:CGS:u32:3
S.194.2.68.check_fall.1:CGS:u32:3
S.194.2.69.check_health.1:CGS:u32:5
S.194.2.73.addr.1:CGS:str:[xxx]:yyy
S.194.2.75.mode.1:CGS:str:tcp
B.194.0.0.pxname.1:KNS:str:sample.service:1234
B.194.0.1.svname.1:KNS:str:BACKEND
B.194.0.5.smax.1:MMP:u32:1
B.194.0.6.slim.1:CLP:u32:410
B.194.0.7.stot.1:MCP:u64:37
B.194.0.17.status.1:SGP:str:UP
B.194.0.18.weight.1:MaP:u32:101
B.194.0.19.act.1:MGP:u32:2
B.194.0.23.lastchg.1:MAP:u32:184
B.194.0.26.pid.1:KGP:u32:1
B.194.0.27.iid.1:KGS:u32:194
B.194.0.32.type.1:CGS:u32:1
B.194.0.35.rate_max.1:MGP:u32:1
B.194.0.49.cli_abrt.1:MCP:u64:37
B.194.0.55.lastsess.1:MAP:s32:-1
B.194.0.75.mode.1:CGS:str:tcp
B.194.0.76.algo.1:CGS:str:leastconn

How can I find out the real reason for these errors?
Could it happen if a client uses RST instead of a regular FIN/ACK sequence
to close the session?
Example of such behavior (tcp-connect probe from keepalived to haproxy):
10:42:48.086780 IP6 client.43930 > haproxy.12346: Flags [S], seq
4136298350, win 28800, options [mss 1440,nop,nop,sackOK,nop,wscale 7],
length 0
10:42:48.086837 IP6 haproxy.12346 > client.43930: Flags [S.], seq
2234519198, ack 4136298351, win 26520, options [mss
8840,nop,nop,sackOK,nop,wscale 11], length 0
10:42:48.087177 IP6 client.43930 > haproxy.12346: Flags [.], ack 1, win
225, length 0
10:42:48.087181 IP6 client.43930 > haproxy.12346: Flags [R.], seq 1, ack 1,
win 225, length 0

$ haproxy -vvv
HA-Proxy version 2.0.3-1 2019/07/24 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -O2 -fPIE -fstack-protector-strong -Wformat
-Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
-fno-strict-aliasing -Wdeclaration-after-statement -fwrapv
-Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
-Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_REGPARM=1 USE_GETADDRINFO=1
USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_TFO=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER -PCRE
-PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED
+REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE
+LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4
-MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS
-51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support 

Haproxy reload and maps

2019-07-24 Thread Sachin Shetty
Hi,

We are using maps extensively in our architecture to map host headers to
backends. The maps are seeded dynamically with a lua handler to an external
service as requests arrive, there are no pre-seeded values in the map, the
physical map file is empty

On haproxy reload at peak traffic, the maps are emptied and I guess that is
expected. But this causes a stampede to the external service which causes
some failures.

Is there a way to prevent emptying of the map when we do an haproxy reload?

Thanks
Sachin


Re: FreeBSD CI builds fail

2019-07-24 Thread Илья Шипицин
ср, 24 июл. 2019 г. в 08:55, Willy Tarreau :

> Hi guys,
>
> On Tue, Jul 23, 2019 at 08:37:37PM +0200, Jerome Magnin wrote:
> > On Tue, Jul 23, 2019 at 07:09:57PM +0200, Tim Düsterhus wrote:
> > > Jérôme,
> > > Ilya,
> > >
> > > I noticed that FreeBSD CI fails since
> > >
> https://github.com/haproxy/haproxy/commit/885f64fb6da0a349dd3182d21d337b528225c517
> .
> > >
> > >
> > > One example is here: https://github.com/haproxy/haproxy/runs/169980019
>
> I also noticed the build failure but couldn't find any link to the build
> history to figure when it started to fail. How did you figure that the
> commit above was the first one ?
>


[image: Screenshot from 2019-07-24 11-43-30.png]


>
> > This one fails because there's a L4 timeout, I can probably update the
> regex to
> > take that into account, the interesting part is the failure and the step
> at
> > which it fails, but for now we expect a connection failure and not a
> timeout.
>
> There's always the possibility (especially in CI environments) that some
> rules are in place on the system to prevent connections to unexpected
> ports.
>
> Ideally we'd need a level of failure in CI builds. Some should be just of
> level "info" and not cause a build error because we'd know they are likely
> to fail but are still interested in the logs. But I don't think we can do
> this.
>
> Willy
>