Re: 1.9 external health checks fail suddenly

2019-07-09 Thread Willy Tarreau
Hi Lukas,

On Tue, Jul 09, 2019 at 03:59:04PM +0200, Lukas Tribus wrote:
> Hello Veiko,
> 
> 
> On Tue, 9 Jul 2019 at 15:40, Veiko Kukk  wrote:
> >
> > On 2019-07-08 16:06, Lukas Tribus wrote:
> > > The bug you may be affected by is:
> > > https://github.com/haproxy/haproxy/issues/141
> > >
> > > Can you check what happens with:
> > > nbthread 1
> >
> > I'm afraid I can't because those are production systems that won't be
> > able to service with single thread, they have relatively high ssl
> > termination load.
> 
> You could probably raise nbproc at that point, if you can get away
> with some stats issues ...
> 
> How are you currently working around this issue? Did you disable
> external checks? I'd assume failing checks have negative impact on
> production systems also.
> 
> 
> Willy, in issue #141 in sounds like you already have an idea how this
> could be fixed, is there a patch that we can ask Veiko to try for
> this?

I didn't have a patch but just did it. It was only compile-tested,
please verify that it works as expected on a non-sensitive machine
first!

Cheers,
Willy
>From 32205189f881b98cb0bbe6ed32178f2929e9a627 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Tue, 9 Jul 2019 16:27:39 +0200
Subject: WIP/BUG: checks: make sure we isolate the thread doing the fork

---
 src/checks.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/checks.c b/src/checks.c
index d3920ce8d..46f93e58f 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -1977,8 +1977,10 @@ static int connect_proc_chk(struct task *t)
 
block_sigchld();
 
+   thread_isolate();
pid = fork();
if (pid < 0) {
+   thread_release();
ha_alert("Failed to fork process for external health check: %s. 
Aborting.\n",
 strerror(errno));
set_server_check_status(check, HCHK_STATUS_SOCKERR, 
strerror(errno));
@@ -2015,6 +2017,7 @@ static int connect_proc_chk(struct task *t)
}
 
/* Parent */
+   thread_release();
if (check->result == CHK_RES_UNKNOWN) {
if (pid_list_add(pid, t) != NULL) {
t->expire = tick_add(now_ms, MS_TO_TICKS(check->inter));
-- 
2.20.1



Re: 1.9 external health checks fail suddenly

2019-07-09 Thread Lukas Tribus
Hello Veiko,


On Tue, 9 Jul 2019 at 15:40, Veiko Kukk  wrote:
>
> On 2019-07-08 16:06, Lukas Tribus wrote:
> > The bug you may be affected by is:
> > https://github.com/haproxy/haproxy/issues/141
> >
> > Can you check what happens with:
> > nbthread 1
>
> I'm afraid I can't because those are production systems that won't be
> able to service with single thread, they have relatively high ssl
> termination load.

You could probably raise nbproc at that point, if you can get away
with some stats issues ...

How are you currently working around this issue? Did you disable
external checks? I'd assume failing checks have negative impact on
production systems also.


Willy, in issue #141 in sounds like you already have an idea how this
could be fixed, is there a patch that we can ask Veiko to try for
this?

cheers,
lukas



Re: 1.9 external health checks fail suddenly

2019-07-09 Thread Veiko Kukk

On 2019-07-08 16:06, Lukas Tribus wrote:

The bug you may be affected by is:
https://github.com/haproxy/haproxy/issues/141

Can you check what happens with:
nbthread 1


I'm afraid I can't because those are production systems that won't be 
able to service with single thread, they have relatively high ssl 
termination load.


Veiko



Re[2]: The case for changing the documentation syntax

2019-07-09 Thread Nick Ramirez
It sounds like restructuredText and Asciidoc are the top choices. They 
both look capable:


http://hyperpolyglot.org/lightweight-markup

I can, as a next step, post this as an Issue on the Github project and 
it can be triaged and tracked.


For something like this, it might even make sense to create a new branch 
so that multiple people can work on it. In that case, splitting the 
documentation into multiple files would be helpful too. If approved,  an 
empty file for each section of the documentation could be created in 
order to have the skeleton of the project. Having the documentation 
split into multiple files may make maintaining the documentation easier 
in the future too (i.e. someone could change one section without 
conflicting with a person making a change in another section).


How have collaborative efforts like this been done in the past? How 
would multiple people be able to commit changes to this branch?


Other thoughts?


-- Original Message --
From: "Pavlos Parissis" 
To: "Nick Ramirez" 
Sent: 7/3/2019 10:44:11 AM
Subject: Re: The case for changing the documentation syntax


On Δευτέρα, 1 Ιουλίου 2019 5:01:33 Μ.Μ. CEST Nick Ramirez wrote:

 Hello all,




[...snip...]


 The solution I am proposing:

 Rather than using a home-grown, difficult to parse,
 not-consistently-used grammar. We should use a standard. We should use
 reStructuredText: http://docutils.sourceforge.net/rst.html
 

 The reStructuredText syntax gives us the following benefits:

 * It is well documented
 * Tools exist to parse this and convert it to other formats (such as
 HTML)
 * Tools exist that will "error check" the document to ensure that the
 correct syntax is used throughout configuration.txt (which would become
 configuration.rst)
 * Tools such as Jekyll can easily parse reStructuredText and build
 sophisticated, beautiful webpages that feature search functionality,
 table-of-contents, images, graphs, links, etc. We could really start to
 make the documentation shine!
 * We won't have to worry about updating special tools because
 reStructuredText syntax will allow us to reliably parse it forever
 * reStructuredText is still easily human-readable using a terminal,
 plain-text editor, etc.

 I and others are fully willing to make the conversion to
 reStructuredText, too. What do you all think?




+1 from me. asciidoctor is something you should have a look at and consider as 
well.
I know that people don't like markdown, but it is very simple to use and that 
is, sometimes, more
important than standards and etc.

My cents,
Pavlos

Re: DOC: Suggest to replace the netstat commands

2019-07-09 Thread Alain Belkadi

On 2019-07-09 10:12, Willy Tarreau wrote:

On Tue, Jul 09, 2019 at 10:09:36AM +0200, Klaus Foerster wrote:

It might be a good idea to show the netstat and the ss command.

netstat is for example no more installed by default on ubuntu systems,
whereas ss is.
Of course netstat can be installed without issues, but it's not there 
by

default.


That's sad, though it's understandable given that ubuntu is not exactly
made to be primarily used from the command line for most of their 
users,

so they possibly don't care about end-user's experience in production
environments where people like to use the same commands on all of
their systems.

But indeed, indicating what command to run instead of netstat on Linux
(at least as a recommended lower cost solution) would be nice.


Hello,

I agree with Willy, as a Linux user (Debian) I've only seen my point of 
view based on my Linux experience.


If the 'ss' tool is no present on *bsd by example, we have to leave the 
doc the more compatible as possible.


So forgot my suggestion, sorry for your time.

Regards,

--
[Alain Belkadi / LinuxBeach]



Re: prometheus service kills ssl handshake

2019-07-09 Thread Aleksandar Lazic
Am 08.07.2019 um 12:37 schrieb Aleksandar Lazic:
> Hi Christopher.
> 
> Am 08.07.2019 um 10:30 schrieb Christopher Faulet:
>> Le 06/07/2019 à 23:02, Aleksandar Lazic a écrit :
>>> Hi.
>>>
>>> I use HAP 2.0.1 with haproxy service with my image.
>>> After some times (~several hours) the ssl handshake stops working for the 
>>> https
>>> frontend which offers the prom service.
>>>
> 
> [snipp]
> 
>>
>> Hi Aleks,
>>
>> Could you check with the latest 2.0 snapshot ? An issue about Prometheus was
>> fixed (#151 on GitHub). And some others about connections.
>>
> 
> Okay I created the image with ss-20190706 .
> 
> https://gitlab.com/aleks001/haproxy20-centos/commit/212ed6f4099dd92c72b426726afdf04022065798

After ~20 hours of running Prometheus scraper with ss-20190706 the ssl handshake
errors are gone. From my point of view works the Prometheus exporter now.

Regards
Aleks



Re: DOC: Suggest to replace the netstat commands

2019-07-09 Thread Willy Tarreau
On Tue, Jul 09, 2019 at 10:09:36AM +0200, Klaus Foerster wrote:
> It might be a good idea to show the netstat and the ss command.
> 
> netstat is for example no more installed by default on ubuntu systems,
> whereas ss is.
> Of course netstat can be installed without issues, but it's not there by
> default.

That's sad, though it's understandable given that ubuntu is not exactly
made to be primarily used from the command line for most of their users,
so they possibly don't care about end-user's experience in production
environments where people like to use the same commands on all of
their systems.

But indeed, indicating what command to run instead of netstat on Linux
(at least as a recommended lower cost solution) would be nice.

Willy



Re: DOC: Suggest to replace the netstat commands

2019-07-09 Thread Klaus Foerster

It might be a good idea to show the netstat and the ss command.

netstat is for example no more installed by default on ubuntu systems,
whereas ss is.
Of course netstat can be installed without issues, but it's not there by 
default.




On 7/9/19 8:20 AM, Willy Tarreau wrote:

On Mon, Jul 08, 2019 at 04:51:24PM +0200, Alain Belkadi wrote:

Hello,

As the "netstat" command is deprecated since a long time (1), I suggest to
replace it with other commands like ss and ip.

I disagree with this. netstat is not deprecated at all, it's superseded
*on linux* because there we have netlink which provides a much faster and
more complete interface than the one used by netstat. But netstat is the
only command you'll find on about all systems and its output format is
pretty consistent.

However it might make sense to add a few lines close to the locations where
netstat is mentioned to indicate that on Linux ss is preferred since it's
much less resource intensive than netstat.

Thanks,
Willy





Re: CPU Spikes

2019-07-09 Thread Sander Klein

Hey Willy,


On 2019-07-09 08:09, Willy Tarreau wrote:
What's you CPU like between the peaks ? 1%, 10%, 50% ? Just to get a 
rough

estimate of whether it's something reaching a critical point or if it's
something doing its mess alone in its corner.


In between the spikes it's about 7% System, 11% User, 6% Softirq, 76% 
Idle. Bandwidth is then about 500Mbit/s, mostly outbound.


What I didn't notice before, but just saw while staring at my graphs, is 
I get more incoming traffic during the CPU spikes. So, I'm doing about 
500Mbit/s, then the incoming traffic rises to about 100Mbit/s (probably 
a HTTP POST), CPU spikes, total traffic drops to about 200Mbit/s,  
everything starts getting slow.


I had HAProxy running on physical hardware with an E5-2407 and 1Gbit 
NIC. Now it is running as a VM on an E5-2650 with 10Gbit NIC. With the 
same issues.



Are you using threads ? I'm asking because I'm currently working on an
issue which I found could cause exactly this behaviour. I'm fairly 
certain

we've met it in the past without being able to attribute it to exactly
this.


Yes, I'm using threads.


If you're using threads, attaching gdb to the process and issuing "info
threads" will tell us where they are. If many of them are in
fd_update_events() or fd_may_recv(), you're likely on the one I've been
working on.

Other possibilities (due to the regularity of your observation) are :
  - timeouts (check in your conf if a 10s timeout appears somewhere,
maybe it triggers and is improperly caught)


I have the following timeouts in defaults:
timeout client  60s
timeout connect 10s
timeout http-keep-alive 4s
timeout http-request15s
timeout queue   30s
timeout server  60s
timeout tarpit  120s

Looking at the spikes again it's more like a 20 second up, 20 second 
down. But that probably has more to do with the POST taking that long.



  - health checks (maybe you have 10s checks, or 2s checks with 4
retries or I don't know what, which causes a special event to
occur after 10s)


Check are every 2s with a rise of 3 and a fall of 3.


In any case you're clearly facing a bug, but it's always difficult to
tell.

It could be useful to issue "show activity" twice 1 second apart when
this happens, and maybe even "show fd" and "show sess all" if you don't
have too many connections.


Right, I will do the above steps. But, since this only happens on 
Mondays we have to wait a bit ;-)


Regards,

Sander


0x2E78FBE8.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature


Re: DOC: Suggest to replace the netstat commands

2019-07-09 Thread Willy Tarreau
On Mon, Jul 08, 2019 at 04:51:24PM +0200, Alain Belkadi wrote:
> 
> Hello,
> 
> As the "netstat" command is deprecated since a long time (1), I suggest to
> replace it with other commands like ss and ip.

I disagree with this. netstat is not deprecated at all, it's superseded
*on linux* because there we have netlink which provides a much faster and
more complete interface than the one used by netstat. But netstat is the
only command you'll find on about all systems and its output format is
pretty consistent.

However it might make sense to add a few lines close to the locations where
netstat is mentioned to indicate that on Linux ss is preferred since it's
much less resource intensive than netstat.

Thanks,
Willy



Re: DOC: Fix typo in management.txt

2019-07-09 Thread Willy Tarreau
Hello Alain,

On Mon, Jul 08, 2019 at 02:57:45PM +0200, Alain Belkadi wrote:
> 
> Hello,
> 
> An another patch for a typo in management.txt

Please try to group your typo changes for a same file in a single
patch as much as possible. A good hint to keep in mind is that we
should try to avoid having two commits with the same subject so
that it's possible to uniquely designate a commit by its subject.
Proceeding like this helps figuring what is specific to one patch
and not the others, and thin if they should be merged or not.

Thanks,
Willy



Re: Replace deprecated reqrep

2019-07-09 Thread Willy Tarreau
On Mon, Jul 08, 2019 at 02:36:13PM +0200, Tim Düsterhus wrote:
> Artur.
> 
> Am 08.07.19 um 14:25 schrieb Artur:
> > Hello,
> > 
> > Could you please suggest how to rewrite following rules written with
> > 'regrep' with 'http-request replace-uri' :
> > 
> > frontend www
> >  reqrep ^([^\ ]*)\ /p3/js/(.*) \1\ /p3/js-min/\2
> > 
> > The idea is to rewrite something similar to "GET /p3/js/file.js
> > HTTP/1.1" with  "GET /p3/js-min/file.js HTTP/1.1".
> > 
> 
> You basically can copy over the "second half" of your rule and adjust
> the numbers referencing the capturing groups:
> 
> http-request replace-uri /p3/js/(.*) /p3/js-min/\1
> 
> Consider either adding a `^` at the start to only match URLs starting
> with /p3/ or add an `if` that ensures only URLs starting with /p3/ are
> modified.

Maybe we should add such examples in the reqrep doc ? Or maybe we
should wait for 1 or 2 other ones to complement this one ?

Willy



Re: CPU Spikes

2019-07-09 Thread Willy Tarreau
Hi Sander,

On Mon, Jul 08, 2019 at 02:44:44PM +0200, Sander Klein wrote:
> Hi,
> 
> I'm having an issue with HAProxy causing CPU spikes with certain traffic.

We've actually fixed quite a number of issues causing this over the last
few years, though most of them are already addressed by the versions you're
running.

> We have a client who is downloading lots of URL's during the night. When the
> download starts there is not much other traffic going on and there doesn't
> seem to be any problem. But, when the morning comes, 'normal' traffic starts
> hitting HAProxy and every 10 seconds or so, HAProxy starts eating 100% of
> CPU while network traffic drops. When HAProxy stops eating CPU after 10
> seconds, network traffic rises again. When the crawler is finished
> everything returns to normal. So it looks like some kind of mix of traffic
> which causes it.

What's you CPU like between the peaks ? 1%, 10%, 50% ? Just to get a rough
estimate of whether it's something reaching a critical point or if it's
something doing its mess alone in its corner.

> I've tested it with HAProxy 1.8.20, 1.9.8 (which I am running by default)
> and 2.0.1. They all show the same behaviour. I also tried with 2 different
> kernels to see if anything happens there. With kernel 4.9 top show HAProxy
> using 100% CPU where 50% is user and 50% is system. With kernel 4.19 I see
> 100% CPU usage with 70% user and 50% system.

In fact once something stats to loop, all calls are so short that it's very
difficult for the system to measure an accurate time spent in user/sys, so
I am not surprised that it changes with the kernel.

> I also tried with disabling H2, splicing, and some regexes I use. Even tried
> new hardware, and moved it to a VM just to see if I could find any
> difference, but none...

Are you using threads ? I'm asking because I'm currently working on an
issue which I found could cause exactly this behaviour. I'm fairly certain
we've met it in the past without being able to attribute it to exactly
this.

> Does anyone have a good idea how to troubleshoot this any further?

If you're using threads, attaching gdb to the process and issuing "info
threads" will tell us where they are. If many of them are in
fd_update_events() or fd_may_recv(), you're likely on the one I've been
working on.

Other possibilities (due to the regularity of your observation) are :
  - timeouts (check in your conf if a 10s timeout appears somewhere,
maybe it triggers and is improperly caught)
  - health checks (maybe you have 10s checks, or 2s checks with 4
retries or I don't know what, which causes a special event to
occur after 10s)

In any case you're clearly facing a bug, but it's always difficult to
tell.

It could be useful to issue "show activity" twice 1 second apart when
this happens, and maybe even "show fd" and "show sess all" if you don't
have too many connections.

Thanks,
Willy