Re: 1.9 external health checks fail suddenly
Hi Lukas, On Tue, Jul 09, 2019 at 03:59:04PM +0200, Lukas Tribus wrote: > Hello Veiko, > > > On Tue, 9 Jul 2019 at 15:40, Veiko Kukk wrote: > > > > On 2019-07-08 16:06, Lukas Tribus wrote: > > > The bug you may be affected by is: > > > https://github.com/haproxy/haproxy/issues/141 > > > > > > Can you check what happens with: > > > nbthread 1 > > > > I'm afraid I can't because those are production systems that won't be > > able to service with single thread, they have relatively high ssl > > termination load. > > You could probably raise nbproc at that point, if you can get away > with some stats issues ... > > How are you currently working around this issue? Did you disable > external checks? I'd assume failing checks have negative impact on > production systems also. > > > Willy, in issue #141 in sounds like you already have an idea how this > could be fixed, is there a patch that we can ask Veiko to try for > this? I didn't have a patch but just did it. It was only compile-tested, please verify that it works as expected on a non-sensitive machine first! Cheers, Willy >From 32205189f881b98cb0bbe6ed32178f2929e9a627 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Tue, 9 Jul 2019 16:27:39 +0200 Subject: WIP/BUG: checks: make sure we isolate the thread doing the fork --- src/checks.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/checks.c b/src/checks.c index d3920ce8d..46f93e58f 100644 --- a/src/checks.c +++ b/src/checks.c @@ -1977,8 +1977,10 @@ static int connect_proc_chk(struct task *t) block_sigchld(); + thread_isolate(); pid = fork(); if (pid < 0) { + thread_release(); ha_alert("Failed to fork process for external health check: %s. Aborting.\n", strerror(errno)); set_server_check_status(check, HCHK_STATUS_SOCKERR, strerror(errno)); @@ -2015,6 +2017,7 @@ static int connect_proc_chk(struct task *t) } /* Parent */ + thread_release(); if (check->result == CHK_RES_UNKNOWN) { if (pid_list_add(pid, t) != NULL) { t->expire = tick_add(now_ms, MS_TO_TICKS(check->inter)); -- 2.20.1
Re: 1.9 external health checks fail suddenly
Hello Veiko, On Tue, 9 Jul 2019 at 15:40, Veiko Kukk wrote: > > On 2019-07-08 16:06, Lukas Tribus wrote: > > The bug you may be affected by is: > > https://github.com/haproxy/haproxy/issues/141 > > > > Can you check what happens with: > > nbthread 1 > > I'm afraid I can't because those are production systems that won't be > able to service with single thread, they have relatively high ssl > termination load. You could probably raise nbproc at that point, if you can get away with some stats issues ... How are you currently working around this issue? Did you disable external checks? I'd assume failing checks have negative impact on production systems also. Willy, in issue #141 in sounds like you already have an idea how this could be fixed, is there a patch that we can ask Veiko to try for this? cheers, lukas
Re: 1.9 external health checks fail suddenly
On 2019-07-08 16:06, Lukas Tribus wrote: The bug you may be affected by is: https://github.com/haproxy/haproxy/issues/141 Can you check what happens with: nbthread 1 I'm afraid I can't because those are production systems that won't be able to service with single thread, they have relatively high ssl termination load. Veiko
Re[2]: The case for changing the documentation syntax
It sounds like restructuredText and Asciidoc are the top choices. They both look capable: http://hyperpolyglot.org/lightweight-markup I can, as a next step, post this as an Issue on the Github project and it can be triaged and tracked. For something like this, it might even make sense to create a new branch so that multiple people can work on it. In that case, splitting the documentation into multiple files would be helpful too. If approved, an empty file for each section of the documentation could be created in order to have the skeleton of the project. Having the documentation split into multiple files may make maintaining the documentation easier in the future too (i.e. someone could change one section without conflicting with a person making a change in another section). How have collaborative efforts like this been done in the past? How would multiple people be able to commit changes to this branch? Other thoughts? -- Original Message -- From: "Pavlos Parissis" To: "Nick Ramirez" Sent: 7/3/2019 10:44:11 AM Subject: Re: The case for changing the documentation syntax On Δευτέρα, 1 Ιουλίου 2019 5:01:33 Μ.Μ. CEST Nick Ramirez wrote: Hello all, [...snip...] The solution I am proposing: Rather than using a home-grown, difficult to parse, not-consistently-used grammar. We should use a standard. We should use reStructuredText: http://docutils.sourceforge.net/rst.html The reStructuredText syntax gives us the following benefits: * It is well documented * Tools exist to parse this and convert it to other formats (such as HTML) * Tools exist that will "error check" the document to ensure that the correct syntax is used throughout configuration.txt (which would become configuration.rst) * Tools such as Jekyll can easily parse reStructuredText and build sophisticated, beautiful webpages that feature search functionality, table-of-contents, images, graphs, links, etc. We could really start to make the documentation shine! * We won't have to worry about updating special tools because reStructuredText syntax will allow us to reliably parse it forever * reStructuredText is still easily human-readable using a terminal, plain-text editor, etc. I and others are fully willing to make the conversion to reStructuredText, too. What do you all think? +1 from me. asciidoctor is something you should have a look at and consider as well. I know that people don't like markdown, but it is very simple to use and that is, sometimes, more important than standards and etc. My cents, Pavlos
Re: DOC: Suggest to replace the netstat commands
On 2019-07-09 10:12, Willy Tarreau wrote: On Tue, Jul 09, 2019 at 10:09:36AM +0200, Klaus Foerster wrote: It might be a good idea to show the netstat and the ss command. netstat is for example no more installed by default on ubuntu systems, whereas ss is. Of course netstat can be installed without issues, but it's not there by default. That's sad, though it's understandable given that ubuntu is not exactly made to be primarily used from the command line for most of their users, so they possibly don't care about end-user's experience in production environments where people like to use the same commands on all of their systems. But indeed, indicating what command to run instead of netstat on Linux (at least as a recommended lower cost solution) would be nice. Hello, I agree with Willy, as a Linux user (Debian) I've only seen my point of view based on my Linux experience. If the 'ss' tool is no present on *bsd by example, we have to leave the doc the more compatible as possible. So forgot my suggestion, sorry for your time. Regards, -- [Alain Belkadi / LinuxBeach]
Re: prometheus service kills ssl handshake
Am 08.07.2019 um 12:37 schrieb Aleksandar Lazic: > Hi Christopher. > > Am 08.07.2019 um 10:30 schrieb Christopher Faulet: >> Le 06/07/2019 à 23:02, Aleksandar Lazic a écrit : >>> Hi. >>> >>> I use HAP 2.0.1 with haproxy service with my image. >>> After some times (~several hours) the ssl handshake stops working for the >>> https >>> frontend which offers the prom service. >>> > > [snipp] > >> >> Hi Aleks, >> >> Could you check with the latest 2.0 snapshot ? An issue about Prometheus was >> fixed (#151 on GitHub). And some others about connections. >> > > Okay I created the image with ss-20190706 . > > https://gitlab.com/aleks001/haproxy20-centos/commit/212ed6f4099dd92c72b426726afdf04022065798 After ~20 hours of running Prometheus scraper with ss-20190706 the ssl handshake errors are gone. From my point of view works the Prometheus exporter now. Regards Aleks
Re: DOC: Suggest to replace the netstat commands
On Tue, Jul 09, 2019 at 10:09:36AM +0200, Klaus Foerster wrote: > It might be a good idea to show the netstat and the ss command. > > netstat is for example no more installed by default on ubuntu systems, > whereas ss is. > Of course netstat can be installed without issues, but it's not there by > default. That's sad, though it's understandable given that ubuntu is not exactly made to be primarily used from the command line for most of their users, so they possibly don't care about end-user's experience in production environments where people like to use the same commands on all of their systems. But indeed, indicating what command to run instead of netstat on Linux (at least as a recommended lower cost solution) would be nice. Willy
Re: DOC: Suggest to replace the netstat commands
It might be a good idea to show the netstat and the ss command. netstat is for example no more installed by default on ubuntu systems, whereas ss is. Of course netstat can be installed without issues, but it's not there by default. On 7/9/19 8:20 AM, Willy Tarreau wrote: On Mon, Jul 08, 2019 at 04:51:24PM +0200, Alain Belkadi wrote: Hello, As the "netstat" command is deprecated since a long time (1), I suggest to replace it with other commands like ss and ip. I disagree with this. netstat is not deprecated at all, it's superseded *on linux* because there we have netlink which provides a much faster and more complete interface than the one used by netstat. But netstat is the only command you'll find on about all systems and its output format is pretty consistent. However it might make sense to add a few lines close to the locations where netstat is mentioned to indicate that on Linux ss is preferred since it's much less resource intensive than netstat. Thanks, Willy
Re: CPU Spikes
Hey Willy, On 2019-07-09 08:09, Willy Tarreau wrote: What's you CPU like between the peaks ? 1%, 10%, 50% ? Just to get a rough estimate of whether it's something reaching a critical point or if it's something doing its mess alone in its corner. In between the spikes it's about 7% System, 11% User, 6% Softirq, 76% Idle. Bandwidth is then about 500Mbit/s, mostly outbound. What I didn't notice before, but just saw while staring at my graphs, is I get more incoming traffic during the CPU spikes. So, I'm doing about 500Mbit/s, then the incoming traffic rises to about 100Mbit/s (probably a HTTP POST), CPU spikes, total traffic drops to about 200Mbit/s, everything starts getting slow. I had HAProxy running on physical hardware with an E5-2407 and 1Gbit NIC. Now it is running as a VM on an E5-2650 with 10Gbit NIC. With the same issues. Are you using threads ? I'm asking because I'm currently working on an issue which I found could cause exactly this behaviour. I'm fairly certain we've met it in the past without being able to attribute it to exactly this. Yes, I'm using threads. If you're using threads, attaching gdb to the process and issuing "info threads" will tell us where they are. If many of them are in fd_update_events() or fd_may_recv(), you're likely on the one I've been working on. Other possibilities (due to the regularity of your observation) are : - timeouts (check in your conf if a 10s timeout appears somewhere, maybe it triggers and is improperly caught) I have the following timeouts in defaults: timeout client 60s timeout connect 10s timeout http-keep-alive 4s timeout http-request15s timeout queue 30s timeout server 60s timeout tarpit 120s Looking at the spikes again it's more like a 20 second up, 20 second down. But that probably has more to do with the POST taking that long. - health checks (maybe you have 10s checks, or 2s checks with 4 retries or I don't know what, which causes a special event to occur after 10s) Check are every 2s with a rise of 3 and a fall of 3. In any case you're clearly facing a bug, but it's always difficult to tell. It could be useful to issue "show activity" twice 1 second apart when this happens, and maybe even "show fd" and "show sess all" if you don't have too many connections. Right, I will do the above steps. But, since this only happens on Mondays we have to wait a bit ;-) Regards, Sander 0x2E78FBE8.asc Description: application/pgp-keys signature.asc Description: OpenPGP digital signature
Re: DOC: Suggest to replace the netstat commands
On Mon, Jul 08, 2019 at 04:51:24PM +0200, Alain Belkadi wrote: > > Hello, > > As the "netstat" command is deprecated since a long time (1), I suggest to > replace it with other commands like ss and ip. I disagree with this. netstat is not deprecated at all, it's superseded *on linux* because there we have netlink which provides a much faster and more complete interface than the one used by netstat. But netstat is the only command you'll find on about all systems and its output format is pretty consistent. However it might make sense to add a few lines close to the locations where netstat is mentioned to indicate that on Linux ss is preferred since it's much less resource intensive than netstat. Thanks, Willy
Re: DOC: Fix typo in management.txt
Hello Alain, On Mon, Jul 08, 2019 at 02:57:45PM +0200, Alain Belkadi wrote: > > Hello, > > An another patch for a typo in management.txt Please try to group your typo changes for a same file in a single patch as much as possible. A good hint to keep in mind is that we should try to avoid having two commits with the same subject so that it's possible to uniquely designate a commit by its subject. Proceeding like this helps figuring what is specific to one patch and not the others, and thin if they should be merged or not. Thanks, Willy
Re: Replace deprecated reqrep
On Mon, Jul 08, 2019 at 02:36:13PM +0200, Tim Düsterhus wrote: > Artur. > > Am 08.07.19 um 14:25 schrieb Artur: > > Hello, > > > > Could you please suggest how to rewrite following rules written with > > 'regrep' with 'http-request replace-uri' : > > > > frontend www > > reqrep ^([^\ ]*)\ /p3/js/(.*) \1\ /p3/js-min/\2 > > > > The idea is to rewrite something similar to "GET /p3/js/file.js > > HTTP/1.1" with "GET /p3/js-min/file.js HTTP/1.1". > > > > You basically can copy over the "second half" of your rule and adjust > the numbers referencing the capturing groups: > > http-request replace-uri /p3/js/(.*) /p3/js-min/\1 > > Consider either adding a `^` at the start to only match URLs starting > with /p3/ or add an `if` that ensures only URLs starting with /p3/ are > modified. Maybe we should add such examples in the reqrep doc ? Or maybe we should wait for 1 or 2 other ones to complement this one ? Willy
Re: CPU Spikes
Hi Sander, On Mon, Jul 08, 2019 at 02:44:44PM +0200, Sander Klein wrote: > Hi, > > I'm having an issue with HAProxy causing CPU spikes with certain traffic. We've actually fixed quite a number of issues causing this over the last few years, though most of them are already addressed by the versions you're running. > We have a client who is downloading lots of URL's during the night. When the > download starts there is not much other traffic going on and there doesn't > seem to be any problem. But, when the morning comes, 'normal' traffic starts > hitting HAProxy and every 10 seconds or so, HAProxy starts eating 100% of > CPU while network traffic drops. When HAProxy stops eating CPU after 10 > seconds, network traffic rises again. When the crawler is finished > everything returns to normal. So it looks like some kind of mix of traffic > which causes it. What's you CPU like between the peaks ? 1%, 10%, 50% ? Just to get a rough estimate of whether it's something reaching a critical point or if it's something doing its mess alone in its corner. > I've tested it with HAProxy 1.8.20, 1.9.8 (which I am running by default) > and 2.0.1. They all show the same behaviour. I also tried with 2 different > kernels to see if anything happens there. With kernel 4.9 top show HAProxy > using 100% CPU where 50% is user and 50% is system. With kernel 4.19 I see > 100% CPU usage with 70% user and 50% system. In fact once something stats to loop, all calls are so short that it's very difficult for the system to measure an accurate time spent in user/sys, so I am not surprised that it changes with the kernel. > I also tried with disabling H2, splicing, and some regexes I use. Even tried > new hardware, and moved it to a VM just to see if I could find any > difference, but none... Are you using threads ? I'm asking because I'm currently working on an issue which I found could cause exactly this behaviour. I'm fairly certain we've met it in the past without being able to attribute it to exactly this. > Does anyone have a good idea how to troubleshoot this any further? If you're using threads, attaching gdb to the process and issuing "info threads" will tell us where they are. If many of them are in fd_update_events() or fd_may_recv(), you're likely on the one I've been working on. Other possibilities (due to the regularity of your observation) are : - timeouts (check in your conf if a 10s timeout appears somewhere, maybe it triggers and is improperly caught) - health checks (maybe you have 10s checks, or 2s checks with 4 retries or I don't know what, which causes a special event to occur after 10s) In any case you're clearly facing a bug, but it's always difficult to tell. It could be useful to issue "show activity" twice 1 second apart when this happens, and maybe even "show fd" and "show sess all" if you don't have too many connections. Thanks, Willy