Re: HaProxy Hang
On Wed, 7 Jun 2017, at 10:42, David King wrote: > Just to close the loop on this, last night was the time at which we were > expecting the next hang. All of the servers we updated haproxy to the > patched versions did not hang. The test servers which were running the > older version hung as expected > > Thanks so much to everyone who fixed the issue! Same here, although as we patched everything we had no issues at all :D Merci beaucoup! A+ Dave
Re: HaProxy Hang
Hi David, On Wed, Jun 07, 2017 at 09:42:58AM +0100, David King wrote: > Just to close the loop on this, last night was the time at which we were > expecting the next hang. All of the servers we updated haproxy to the > patched versions did not hang. The test servers which were running the > older version hung as expected > > Thanks so much to everyone who fixed the issue! Feedback much appreciated, thank you! We need to issue 1.7.6 soon with this fix but other troubling ones being under investigation have delayed this a bit. Cheers, Willy
Re: HaProxy Hang
Just to close the loop on this, last night was the time at which we were expecting the next hang. All of the servers we updated haproxy to the patched versions did not hang. The test servers which were running the older version hung as expected Thanks so much to everyone who fixed the issue! On 18 April 2017 at 10:45, Willy Tarreauwrote: > Hi David, > > On Tue, Apr 18, 2017 at 10:33:40AM +0100, David King wrote: > > Hi All > > > > Just like to confirm Willy's theory, we had the hang at exactly the time > > specified this morning. > > I could recycle myself in a new church of which I would be the prophet... > well maybe it already exists, we have thousands of adepts after all :-) > > More seriously, I think it will be useful to report a bug to the FreeBSD > project, there are quite a number of elements, possibly nothing that can > make it obvious where the problem could be, but a number of hypothesis > can be ruled out already I think. It's possible that some FreeBSD devs > ask us to monitor a few things or capture some syscall returns, or try > some workarounds and this might require some dev. So in short, the earlier > the better if we want to be ready for the next occurrence. > > Cheers, > Willy >
Re: HaProxy Hang
Hi David, On Tue, Apr 18, 2017 at 10:33:40AM +0100, David King wrote: > Hi All > > Just like to confirm Willy's theory, we had the hang at exactly the time > specified this morning. I could recycle myself in a new church of which I would be the prophet... well maybe it already exists, we have thousands of adepts after all :-) More seriously, I think it will be useful to report a bug to the FreeBSD project, there are quite a number of elements, possibly nothing that can make it obvious where the problem could be, but a number of hypothesis can be ruled out already I think. It's possible that some FreeBSD devs ask us to monitor a few things or capture some syscall returns, or try some workarounds and this might require some dev. So in short, the earlier the better if we want to be ready for the next occurrence. Cheers, Willy
Re: HaProxy Hang
Hi All Just like to confirm Willy's theory, we had the hang at exactly the time specified this morning. Sadly due to a bank holiday yesterday in the UK, we didn't set up the truss and monitoring before the hang occurred. Was the hang seen by everyone? Thanks Dave On 6 April 2017 at 14:56, Mark Swrote: > On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber > wrote: > > On Mon, 13 Mar 2017, at 13:31, David King wrote: >> >>> Hi All >>> >>> Apologies for the delay in response, i've been out of the country for the >>> last week >>> >>> Mark, my gut feeling is that is network related in someway, so thought we >>> could compare the networking setup of our systems >>> >>> You mentioned you see the hang across geo locations, so i assume there >>> isn't layer 2 connectivity between all of the hosts? is there any back >>> end >>> connectivity between the haproxy hosts? >>> >> >> Following up on this, some interesting points but nothing useful. >> >> - Mark & I see the hang at almost exactly the same time on the same day: >> 2017-02-27T14:36Z give or take a minute either way >> >> - I see the hang in 3 different regions using 2 different hosting >> providers on both clustered and non-clustered services, but all on >> FreeBSD 11.0R amd64. There is some dependency between these systems but >> nothing unusual (logging backends, reverse proxied services etc). >> >> - our servers don't have a specific workload that would allow them all >> to run out of some internal resource at the same time, as their reboot >> and patch cycles are reasonably different - typically a few days elapse >> between first patches and last reboots unless its deemed high risk >> >> - our networking setup is not complex but typical FreeBSD: >> - LACP bonded Gbit igb(4) NICs >> - CARP failover for both ipv4 & ipv6 addresses >> - either direct to haproxy for http & TLS traffic, or via spiped to >> decrypt intra-server traffic >> - haproxy directs traffic into jailed services >> - our overall load and throughput is low but consistent >> - pf firewall >> - rsyslog for logging, along with riemann and graphite for metrics >> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy >> - haproxy 1.6.10 + libressl at the time >> >> As I'm not one for conspiracy theories or weird coincidences, somebody >> port scanning the internet with an Unexpectedly Evil Packet Combo seems >> the most plausible explanation. I cannot find an alternative that would >> fit the scenario of 3 different organisations with geographically >> distributed equipment and unconnected services reporting an unusual >> interruption on the same day and almost the same time. >> >> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest >> libressl and seen no recurrence, just like the last 8+ months or so >> since first deploying haproxy on FreeBSD instead of debian & nginx. >> >> If the issue recurs I plan to run a small cyclic traffic capture with >> tcpdump and wait for a re-repeat, see >> https://superuser.com/questions/286062/practical-tcpdump-examples >> >> Let me know if I can help or clarify further. >> >> A+ >> Dave >> > > Hi Dave, > > Thanks for keeping this thread going. As for the initial report with all > servers hanging, I too run NTP (actually OpenNTPd), and these only speak to > in-house stratum-2 servers. > > As a follow-up to my initial report, I upgraded to 1.7.3 shortly > thereafter. > > I've had one re-occurrence of this "hang" but this time, it did not affect > all of my servers, instead, it affected only 2 (the busier ones). If the > theory about some timing event ( leap second, counter wrapping, etc.) is > correct, perhaps it only affects processes actually accepting or handling a > connection in a particular state at the time. > > I have not yet upgraded beyond 1.7.3. > > Best, > -=Mark >
Re: HaProxy Hang
On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuberwrote: On Mon, 13 Mar 2017, at 13:31, David King wrote: Hi All Apologies for the delay in response, i've been out of the country for the last week Mark, my gut feeling is that is network related in someway, so thought we could compare the networking setup of our systems You mentioned you see the hang across geo locations, so i assume there isn't layer 2 connectivity between all of the hosts? is there any back end connectivity between the haproxy hosts? Following up on this, some interesting points but nothing useful. - Mark & I see the hang at almost exactly the same time on the same day: 2017-02-27T14:36Z give or take a minute either way - I see the hang in 3 different regions using 2 different hosting providers on both clustered and non-clustered services, but all on FreeBSD 11.0R amd64. There is some dependency between these systems but nothing unusual (logging backends, reverse proxied services etc). - our servers don't have a specific workload that would allow them all to run out of some internal resource at the same time, as their reboot and patch cycles are reasonably different - typically a few days elapse between first patches and last reboots unless its deemed high risk - our networking setup is not complex but typical FreeBSD: - LACP bonded Gbit igb(4) NICs - CARP failover for both ipv4 & ipv6 addresses - either direct to haproxy for http & TLS traffic, or via spiped to decrypt intra-server traffic - haproxy directs traffic into jailed services - our overall load and throughput is low but consistent - pf firewall - rsyslog for logging, along with riemann and graphite for metrics - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy - haproxy 1.6.10 + libressl at the time As I'm not one for conspiracy theories or weird coincidences, somebody port scanning the internet with an Unexpectedly Evil Packet Combo seems the most plausible explanation. I cannot find an alternative that would fit the scenario of 3 different organisations with geographically distributed equipment and unconnected services reporting an unusual interruption on the same day and almost the same time. Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest libressl and seen no recurrence, just like the last 8+ months or so since first deploying haproxy on FreeBSD instead of debian & nginx. If the issue recurs I plan to run a small cyclic traffic capture with tcpdump and wait for a re-repeat, see https://superuser.com/questions/286062/practical-tcpdump-examples Let me know if I can help or clarify further. A+ Dave Hi Dave, Thanks for keeping this thread going. As for the initial report with all servers hanging, I too run NTP (actually OpenNTPd), and these only speak to in-house stratum-2 servers. As a follow-up to my initial report, I upgraded to 1.7.3 shortly thereafter. I've had one re-occurrence of this "hang" but this time, it did not affect all of my servers, instead, it affected only 2 (the busier ones). If the theory about some timing event ( leap second, counter wrapping, etc.) is correct, perhaps it only affects processes actually accepting or handling a connection in a particular state at the time. I have not yet upgraded beyond 1.7.3. Best, -=Mark
Re: HaProxy Hang
On Wed, Apr 05, 2017 at 10:10:49AM +0100, David King wrote: > I'm going to keep with version 1.7.2 till then, so we should have a > comparison OK as you like :-) > If we think we may have a hang at Tue Apr 18, 9:38, is there any specific > logging we should set up on a server at that time? Maybe detailed truss output if it happens, to get all arguments and a few things like this. Unfortunately for now I don't see an easy way to reset the kqueue fd and reinitialize all events from scratch (though it's possible, just requires quite some code and will come with some bugs). > is it worth setting at > least one server to have nokqueue set at that time? Well, possibly if you have multiple servers and all of them die at the same time, that could avoid a complete outage. And maybe nothing will happen, it was a pure guess from me but given that these ones more or less match issues we've had a long time ago with looping timers, I would not be surprized if it happens this way. Willy
Re: HaProxy Hang
I'm going to keep with version 1.7.2 till then, so we should have a comparison If we think we may have a hang at Tue Apr 18, 9:38, is there any specific logging we should set up on a server at that time? is it worth setting at least one server to have nokqueue set at that time? Thanks David On 5 April 2017 at 07:00, Willy Tarreauwrote: > Hi all, > > On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote: > > Can we be absolutely positive that those hangs are not directly or > > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for > > example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD > 11.0-p8"? > > I don't believe in this at all unfortunately. The issues that were faced > on FreeBSD in earlier versions were related to connect() occasionally > succeeding synchronously and haproxy did not handle this case cleanly > (it initially used to poll then validate the connect() a second time, > and fixing this broke the rest). > > > There maybe multiple and different symptoms of those bugs, so even if the > > descriptions in those threads don't match your case 100%, it may still > > caused by the same underlying bug. > > > > A confirmation that hose hangs are still happening in v1.7.5 would be > > crucial. > > I'm pretty sure they will still happen. > > > The time co-incidence is intriguing, but I would not spend too much time > > with that. Collecting actual traces (like strace or its freebsd > equivalent) > > and capture dumps is more likely to achieve progress, imo. > > In fact I do think there's an operating system issue here (and those who > know me also know that I'm not one who tries to hide haproxy bugs). What > I suspect is that there's a problem when time wraps. A 1 kHz scheduler > wraps every 49.7 days. With clocks synchronized over NTP, all of them > wrap exactly at the same time. If the issue is there, it may happen > again on Tue Apr 18, 9:38 (13 days from now). > > It could have been haproxy's time wrapping and causing the issue, so I > modified it to add an offset and make the time wrap 5s after startup, > and couldn't trigger the problem on a FreeBSD system, even after > multiple attempts. And the time of first crash reported above doesn't > match any wrapping pattern (0x58b43950). Also, reporters indicated > that the issue appeared after migrating to FreeBSD 11 and no such > issue was ever reported on earlier versions. > > Also Dave reported this, which is totally abnormal : > > kqueue(0,0,0) = 22 (EINVAL) > > and the fact that the system panicked, which cannot be an haproxy issue. > > Another point, Dave reported a loss of network connectivity at the > same moment when it last happened. Dave, could this be related to > other FreeBSD nodes running FreeBSD as well and rebooting or any > such thing ? > > I think that at this point we should discuss with some FreeBSD > maintainers and see what can be done to track this problem down, even > if it means adding some debugging code in the kqueue loop to help > troubleshoot this, or using it differently if we're doing something > wrong. > > Given that Mark indicated that reloading the process fixed the problem > (except he had to manually kill the previous one), one possible workaround > might be to detect the EINVAL, and try to reinitialize kqueue or switch > to poll() if this happens (and emit loud warnings in the logs). > > > Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish > y'all > > a good night, > > There's still a faint possibility of a widespread attack but while I > can easily imagine some such devices sending a "packet of death" > exploiting a bug in an OS, I don't believe it would make kqueue() > return EINVAL in haproxy. > > Cheers, > Willy >
Re: HaProxy Hang
Hi all, On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote: > Can we be absolutely positive that those hangs are not directly or > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for > example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 11.0-p8"? I don't believe in this at all unfortunately. The issues that were faced on FreeBSD in earlier versions were related to connect() occasionally succeeding synchronously and haproxy did not handle this case cleanly (it initially used to poll then validate the connect() a second time, and fixing this broke the rest). > There maybe multiple and different symptoms of those bugs, so even if the > descriptions in those threads don't match your case 100%, it may still > caused by the same underlying bug. > > A confirmation that hose hangs are still happening in v1.7.5 would be > crucial. I'm pretty sure they will still happen. > The time co-incidence is intriguing, but I would not spend too much time > with that. Collecting actual traces (like strace or its freebsd equivalent) > and capture dumps is more likely to achieve progress, imo. In fact I do think there's an operating system issue here (and those who know me also know that I'm not one who tries to hide haproxy bugs). What I suspect is that there's a problem when time wraps. A 1 kHz scheduler wraps every 49.7 days. With clocks synchronized over NTP, all of them wrap exactly at the same time. If the issue is there, it may happen again on Tue Apr 18, 9:38 (13 days from now). It could have been haproxy's time wrapping and causing the issue, so I modified it to add an offset and make the time wrap 5s after startup, and couldn't trigger the problem on a FreeBSD system, even after multiple attempts. And the time of first crash reported above doesn't match any wrapping pattern (0x58b43950). Also, reporters indicated that the issue appeared after migrating to FreeBSD 11 and no such issue was ever reported on earlier versions. Also Dave reported this, which is totally abnormal : kqueue(0,0,0) = 22 (EINVAL) and the fact that the system panicked, which cannot be an haproxy issue. Another point, Dave reported a loss of network connectivity at the same moment when it last happened. Dave, could this be related to other FreeBSD nodes running FreeBSD as well and rebooting or any such thing ? I think that at this point we should discuss with some FreeBSD maintainers and see what can be done to track this problem down, even if it means adding some debugging code in the kqueue loop to help troubleshoot this, or using it differently if we're doing something wrong. Given that Mark indicated that reloading the process fixed the problem (except he had to manually kill the previous one), one possible workaround might be to detect the EINVAL, and try to reinitialize kqueue or switch to poll() if this happens (and emit loud warnings in the logs). > Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish y'all > a good night, There's still a faint possibility of a widespread attack but while I can easily imagine some such devices sending a "packet of death" exploiting a bug in an OS, I don't believe it would make kqueue() return EINVAL in haproxy. Cheers, Willy
Re: HaProxy Hang
On Wed, 5 Apr 2017, at 01:34, Lukas Tribus wrote: > Hello, > > > Am 05.04.2017 um 00:27 schrieb David King: > > Hi Dave > > > > Thanks for the info, So interestingly we had the crash at exactly the > > same time, so we are 3 for 3 on that > > > > The setups sounds very similar, but given we all saw issue at the same > > time, it really points to something more global. > > > > We are using NTP from our firewalls, which in turn get it from our > > ISP, so i doubt that is the cause, so it could be external port > > scanning which is the cause as you suggest. or maybe a leap second of > > some sort? > > > > Willy any thoughts on the time co-incidence? > > Can we be absolutely positive that those hangs are not directly or > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, > for example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD > 11.0-p8"? > > There maybe multiple and different symptoms of those bugs, so even if > the descriptions in those threads don't match your case 100%, it may > still caused by the same underlying bug. I'll update from 1.7.3 to 1.7.5 with those goodies tomorrow and see how that goes. A+ Dave
Re: HaProxy Hang
Hello, Am 05.04.2017 um 00:27 schrieb David King: Hi Dave Thanks for the info, So interestingly we had the crash at exactly the same time, so we are 3 for 3 on that The setups sounds very similar, but given we all saw issue at the same time, it really points to something more global. We are using NTP from our firewalls, which in turn get it from our ISP, so i doubt that is the cause, so it could be external port scanning which is the cause as you suggest. or maybe a leap second of some sort? Willy any thoughts on the time co-incidence? Can we be absolutely positive that those hangs are not directly or indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 11.0-p8"? There maybe multiple and different symptoms of those bugs, so even if the descriptions in those threads don't match your case 100%, it may still caused by the same underlying bug. A confirmation that hose hangs are still happening in v1.7.5 would be crucial. The time co-incidence is intriguing, but I would not spend too much time with that. Collecting actual traces (like strace or its freebsd equivalent) and capture dumps is more likely to achieve progress, imo. Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish y'all a good night, lukas
Re: HaProxy Hang
Hi Dave Thanks for the info, So interestingly we had the crash at exactly the same time, so we are 3 for 3 on that The setups sounds very similar, but given we all saw issue at the same time, it really points to something more global. We are using NTP from our firewalls, which in turn get it from our ISP, so i doubt that is the cause, so it could be external port scanning which is the cause as you suggest. or maybe a leap second of some sort? Willy any thoughts on the time co-incidence? Thanks Dave On 3 April 2017 at 17:45, Dave Cottlehuberwrote: > On Mon, 13 Mar 2017, at 13:31, David King wrote: > > Hi All > > > > Apologies for the delay in response, i've been out of the country for the > > last week > > > > Mark, my gut feeling is that is network related in someway, so thought we > > could compare the networking setup of our systems > > > > You mentioned you see the hang across geo locations, so i assume there > > isn't layer 2 connectivity between all of the hosts? is there any back > > end > > connectivity between the haproxy hosts? > > Following up on this, some interesting points but nothing useful. > > - Mark & I see the hang at almost exactly the same time on the same day: > 2017-02-27T14:36Z give or take a minute either way > > - I see the hang in 3 different regions using 2 different hosting > providers on both clustered and non-clustered services, but all on > FreeBSD 11.0R amd64. There is some dependency between these systems but > nothing unusual (logging backends, reverse proxied services etc). > > - our servers don't have a specific workload that would allow them all > to run out of some internal resource at the same time, as their reboot > and patch cycles are reasonably different - typically a few days elapse > between first patches and last reboots unless its deemed high risk > > - our networking setup is not complex but typical FreeBSD: > - LACP bonded Gbit igb(4) NICs > - CARP failover for both ipv4 & ipv6 addresses > - either direct to haproxy for http & TLS traffic, or via spiped to > decrypt intra-server traffic > - haproxy directs traffic into jailed services > - our overall load and throughput is low but consistent > - pf firewall > - rsyslog for logging, along with riemann and graphite for metrics > - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy > - haproxy 1.6.10 + libressl at the time > > As I'm not one for conspiracy theories or weird coincidences, somebody > port scanning the internet with an Unexpectedly Evil Packet Combo seems > the most plausible explanation. I cannot find an alternative that would > fit the scenario of 3 different organisations with geographically > distributed equipment and unconnected services reporting an unusual > interruption on the same day and almost the same time. > > Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest > libressl and seen no recurrence, just like the last 8+ months or so > since first deploying haproxy on FreeBSD instead of debian & nginx. > > If the issue recurs I plan to run a small cyclic traffic capture with > tcpdump and wait for a re-repeat, see > https://superuser.com/questions/286062/practical-tcpdump-examples > > Let me know if I can help or clarify further. > > A+ > Dave >
Re: HaProxy Hang
On Mon, 13 Mar 2017, at 13:31, David King wrote: > Hi All > > Apologies for the delay in response, i've been out of the country for the > last week > > Mark, my gut feeling is that is network related in someway, so thought we > could compare the networking setup of our systems > > You mentioned you see the hang across geo locations, so i assume there > isn't layer 2 connectivity between all of the hosts? is there any back > end > connectivity between the haproxy hosts? Following up on this, some interesting points but nothing useful. - Mark & I see the hang at almost exactly the same time on the same day: 2017-02-27T14:36Z give or take a minute either way - I see the hang in 3 different regions using 2 different hosting providers on both clustered and non-clustered services, but all on FreeBSD 11.0R amd64. There is some dependency between these systems but nothing unusual (logging backends, reverse proxied services etc). - our servers don't have a specific workload that would allow them all to run out of some internal resource at the same time, as their reboot and patch cycles are reasonably different - typically a few days elapse between first patches and last reboots unless its deemed high risk - our networking setup is not complex but typical FreeBSD: - LACP bonded Gbit igb(4) NICs - CARP failover for both ipv4 & ipv6 addresses - either direct to haproxy for http & TLS traffic, or via spiped to decrypt intra-server traffic - haproxy directs traffic into jailed services - our overall load and throughput is low but consistent - pf firewall - rsyslog for logging, along with riemann and graphite for metrics - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy - haproxy 1.6.10 + libressl at the time As I'm not one for conspiracy theories or weird coincidences, somebody port scanning the internet with an Unexpectedly Evil Packet Combo seems the most plausible explanation. I cannot find an alternative that would fit the scenario of 3 different organisations with geographically distributed equipment and unconnected services reporting an unusual interruption on the same day and almost the same time. Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest libressl and seen no recurrence, just like the last 8+ months or so since first deploying haproxy on FreeBSD instead of debian & nginx. If the issue recurs I plan to run a small cyclic traffic capture with tcpdump and wait for a re-repeat, see https://superuser.com/questions/286062/practical-tcpdump-examples Let me know if I can help or clarify further. A+ Dave
Re: HaProxy Hang
Hi All Apologies for the delay in response, i've been out of the country for the last week Mark, my gut feeling is that is network related in someway, so thought we could compare the networking setup of our systems You mentioned you see the hang across geo locations, so i assume there isn't layer 2 connectivity between all of the hosts? is there any back end connectivity between the haproxy hosts? Ours are all layer 2 but are fairly complex. We have 6 connected NIC's which are bonded into 3 LACP groups. over the top of the LACP we have a number of VLAN interfaces. we also have a couple of normal IP aliases and a number of CARP IP's on top of that One commonality is NTP as they all sync from our own upstream NTP services, but having looked through the logs, there isn't a recent NTP update when the hang occurs and i can't see any time jump other things which are set up on the host: local rsyslog which sends logs to centralised host we have crons every minute for each jail (4 jails) to monitor the health of the haproxy service we have crons every minute for each jail (4 jails) to gather stats from haproxy using haproxy stats frontend we run pf on the host Chef runs every 30 mins, and these times are splayed does anything match up on these which could cause these issues? Thanks Dave On 6 March 2017 at 20:28, Mark Swrote: > On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreau wrote: > > OK so that means that haproxy could have hung in a day or two, then your >> case is much more common than one of the other reports. If your fdront LB >> is fair between the 6 servers, that could be related to a total number of >> requests or connections or something like this. >> > > Another relevant point is that these servers are tied together using > upstream, GeoIP-based DNS load balancing. So the request rate across > servers varies quite a bit depending on the location. This would make a > synchronized failure based on total requests less likely. > > I'm thinking about other things : >> - if you're doing a lot of SSL we could imagine an issue with random >> generation using /dev/random instead of /dev/urandom. I've met this >> issue a long time ago on some apache servers where all the entropy >> was progressively consumed until it was not possible anymore to get >> a connection. >> > > I'll set up a script to capture the netstat and other info prior to > reloading should this issue re-occur. > > As for SSL, yes, we do a fair bit of SSL ( about 30% of total request > count ) and HAProxy does the TLS termination and then hands off via TCP > proxy. > > Best, > -=Mark S. >
Re: HaProxy Hang
Willy, per your comment on /dev/random exhaustion. I think running haveged on servers doing crypto work is/should be best practice. jerry On 3/6/17 12:02 PM, Willy Tarreau wrote: Hi Mark, On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote: As for the timing issue, I can add to the discussion with a few related data points. In short, system uptime does not seem to be a commonality to my situation. thanks! 1) I had this issue affect 6 servers, spread across 5 data centers (only 2 servers are in the same facility.) All servers stopped processing requests at roughly the same moment, certainly within the same minute. All servers running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against OpenSSL-1.0.2k OK. 2) System uptime was not at all similar across these servers, although chances are most servers HAProxy process start time would be similar. The servers with the highest system uptime were at about 27 days at the time of the incident, while the shortest were under a day or two. OK so that means that haproxy could have hung in a day or two, then your case is much more common than one of the other reports. If your fdront LB is fair between the 6 servers, that could be related to a total number of requests or connections or something like this. 3) HAProxy configurations are similar, but not exactly consistent between servers - different IPs on the frontend, different ACLs and backends. OK. 4) The only synchronized application common to all of these servers is OpenNTPd. Is there any risk that the ntpd causes time jumps in the future or in the past for whatever reasons ? Maybe there's something with kqueue and time jumps in recent versions ? 5) I have since upgraded to HAProxy-1.7.3, same build process: the full version output is below - and will of course report any observed issues. haproxy -vv HA-Proxy version 1.7.3 2017/02/28 (...) Everything there looks pretty standard. If it dies again it could be good to try with "nokqueue" in the global section (or start haproxy with -dk) to disable kqueue and switch to poll. It will eat a bit more CPU, so don't do this on all nodes at once. I'm thinking about other things : - if you're doing a lot of SSL we could imagine an issue with random generation using /dev/random instead of /dev/urandom. I've met this issue a long time ago on some apache servers where all the entropy was progressively consumed until it was not possible anymore to get a connection. - it could be useful to run "netstat -an" on a dead node before killing haproxy and archive this for later analysis. It may reveal that all file descriptors were used by close_wait connections (indicating a close bug in haproxy) or something like this. If instead you see a lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some external firewall or pf blocking some final traffic and leading to socket space exhaustion. If you have the same issue that was reported with kevent() being called in loops and returning an error, you may definitely see tons of close_wait and it will indicate an issue with this poller, though I have no idea which one, especially since it doesn't change often and *seems* to work with previous versions. Best regards, Willy -- Soundhound Devops "What could possibly go wrong?"
Re: HaProxy Hang
On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreauwrote: OK so that means that haproxy could have hung in a day or two, then your case is much more common than one of the other reports. If your fdront LB is fair between the 6 servers, that could be related to a total number of requests or connections or something like this. Another relevant point is that these servers are tied together using upstream, GeoIP-based DNS load balancing. So the request rate across servers varies quite a bit depending on the location. This would make a synchronized failure based on total requests less likely. I'm thinking about other things : - if you're doing a lot of SSL we could imagine an issue with random generation using /dev/random instead of /dev/urandom. I've met this issue a long time ago on some apache servers where all the entropy was progressively consumed until it was not possible anymore to get a connection. I'll set up a script to capture the netstat and other info prior to reloading should this issue re-occur. As for SSL, yes, we do a fair bit of SSL ( about 30% of total request count ) and HAProxy does the TLS termination and then hands off via TCP proxy. Best, -=Mark S.
Re: HaProxy Hang
Hi Mark, On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote: > As for the timing issue, I can add to the discussion with a few related data > points. In short, system uptime does not seem to be a commonality to my > situation. thanks! > 1) I had this issue affect 6 servers, spread across 5 data centers (only 2 > servers are in the same facility.) All servers stopped processing requests > at roughly the same moment, certainly within the same minute. All servers > running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against > OpenSSL-1.0.2k OK. > 2) System uptime was not at all similar across these servers, although > chances are most servers HAProxy process start time would be similar. The > servers with the highest system uptime were at about 27 days at the time of > the incident, while the shortest were under a day or two. OK so that means that haproxy could have hung in a day or two, then your case is much more common than one of the other reports. If your fdront LB is fair between the 6 servers, that could be related to a total number of requests or connections or something like this. > 3) HAProxy configurations are similar, but not exactly consistent between > servers - different IPs on the frontend, different ACLs and backends. OK. > 4) The only synchronized application common to all of these servers is > OpenNTPd. Is there any risk that the ntpd causes time jumps in the future or in the past for whatever reasons ? Maybe there's something with kqueue and time jumps in recent versions ? > 5) I have since upgraded to HAProxy-1.7.3, same build process: the full > version output is below - and will of course report any observed issues. > > haproxy -vv > HA-Proxy version 1.7.3 2017/02/28 (...) Everything there looks pretty standard. If it dies again it could be good to try with "nokqueue" in the global section (or start haproxy with -dk) to disable kqueue and switch to poll. It will eat a bit more CPU, so don't do this on all nodes at once. I'm thinking about other things : - if you're doing a lot of SSL we could imagine an issue with random generation using /dev/random instead of /dev/urandom. I've met this issue a long time ago on some apache servers where all the entropy was progressively consumed until it was not possible anymore to get a connection. - it could be useful to run "netstat -an" on a dead node before killing haproxy and archive this for later analysis. It may reveal that all file descriptors were used by close_wait connections (indicating a close bug in haproxy) or something like this. If instead you see a lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some external firewall or pf blocking some final traffic and leading to socket space exhaustion. If you have the same issue that was reported with kevent() being called in loops and returning an error, you may definitely see tons of close_wait and it will indicate an issue with this poller, though I have no idea which one, especially since it doesn't change often and *seems* to work with previous versions. Best regards, Willy
Re: HaProxy Hang
On Mon, 06 Mar 2017 01:35:19 -0500, Willy Tarreauwrote: On Fri, Mar 03, 2017 at 07:54:46PM +0300, Dmitry Sivachenko wrote: > On 03 Mar 2017, at 19:36, David King wrote: > > Thanks for the response! > Thats interesting, i don't suppose you have the details of the other issues? First report is https://www.mail-archive.com/haproxy@formilux.org/msg25060.html Second one https://www.mail-archive.com/haproxy@formilux.org/msg25067.html Thanks for the links Dmitry. That's indeed really odd. If all hang at the same time, timing or uptime looks like a good candidate. There's not much which is really specific to FreeBSD in haproxy. However, the kqueue poller is only used there (and on OpenBSD), and uses timing for the timeout. Thus it sounds likely that there could be an issue there, either in haproxy or FreeBSD. A hang every 2-3 months makes me think about the 49.7 days it takes for a millisecond counter to wrap. These bugs are hard to troubleshoot. We used to have such an issue a long time ago in linux 2.4 when the timer was set to 100 Hz, it required 497 days to know whether the bug was solved or not (obviously it now is). I've just compared ev_epoll.c and ev_kqueue.c in case I could spot anything obvious but from what I'm seeing they're pretty much similar so I don't see what there could cause this bug. And since it apparently works fine on FreeBSD 10, at best one of our bugs could only trigger a system bug if it exists. David, if your workload permits it, you can disable kqueue and haproxy will automatically fall back to poll. For this you can simply put "nokqueue" in the global section. poll() doesn't scale as well as kqueue(), it's cheaper on low connection counts but it will use more CPU above ~1000 concurrent connections. Regards, Willy Hi Willy, As for the timing issue, I can add to the discussion with a few related data points. In short, system uptime does not seem to be a commonality to my situation. 1) I had this issue affect 6 servers, spread across 5 data centers (only 2 servers are in the same facility.) All servers stopped processing requests at roughly the same moment, certainly within the same minute. All servers running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against OpenSSL-1.0.2k 2) System uptime was not at all similar across these servers, although chances are most servers HAProxy process start time would be similar. The servers with the highest system uptime were at about 27 days at the time of the incident, while the shortest were under a day or two. 3) HAProxy configurations are similar, but not exactly consistent between servers - different IPs on the frontend, different ACLs and backends. 4) The only synchronized application common to all of these servers is OpenNTPd. 5) I have since upgraded to HAProxy-1.7.3, same build process: the full version output is below - and will of course report any observed issues. haproxy -vv HA-Proxy version 1.7.3 2017/02/28 Copyright 2000-2017 Willy Tarreau Build options : TARGET = freebsd CPU = generic CC = clang CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement OPTIONS = USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Encrypted password support via crypt(3): yes Built without compression support (neither USE_ZLIB nor USE_SLZ are set) Compression algorithms supported : identity("identity") Built with OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017 Running on OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 8.39 2016-06-14 Running on PCRE version : 8.39 2016-06-14 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built without Lua support Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY Available polling systems : kqueue : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use kqueue. Available filters : [SPOE] spoe [TRACE] trace [COMP] compression Cheers, -=Mark
Re: HaProxy Hang
On Fri, Mar 03, 2017 at 07:54:46PM +0300, Dmitry Sivachenko wrote: > > > On 03 Mar 2017, at 19:36, David Kingwrote: > > > > Thanks for the response! > > Thats interesting, i don't suppose you have the details of the other issues? > > > First report is > https://www.mail-archive.com/haproxy@formilux.org/msg25060.html > Second one > https://www.mail-archive.com/haproxy@formilux.org/msg25067.html Thanks for the links Dmitry. That's indeed really odd. If all hang at the same time, timing or uptime looks like a good candidate. There's not much which is really specific to FreeBSD in haproxy. However, the kqueue poller is only used there (and on OpenBSD), and uses timing for the timeout. Thus it sounds likely that there could be an issue there, either in haproxy or FreeBSD. A hang every 2-3 months makes me think about the 49.7 days it takes for a millisecond counter to wrap. These bugs are hard to troubleshoot. We used to have such an issue a long time ago in linux 2.4 when the timer was set to 100 Hz, it required 497 days to know whether the bug was solved or not (obviously it now is). I've just compared ev_epoll.c and ev_kqueue.c in case I could spot anything obvious but from what I'm seeing they're pretty much similar so I don't see what there could cause this bug. And since it apparently works fine on FreeBSD 10, at best one of our bugs could only trigger a system bug if it exists. David, if your workload permits it, you can disable kqueue and haproxy will automatically fall back to poll. For this you can simply put "nokqueue" in the global section. poll() doesn't scale as well as kqueue(), it's cheaper on low connection counts but it will use more CPU above ~1000 concurrent connections. Regards, Willy
Re: HaProxy Hang
> Am 03.03.2017 um 15:07 schrieb David King: > > Hi All > > Hoping someone will be able to help, we're running a bit of an interesting > setup > > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each > running haproxy, but only one of the jails is under any real load > > Do you use ZFS? We have an internal software (some sort of monitoring agent) that also hangs in jails, from time to time. The guy who wrote it found out it’s because of mmap (I don’t know the specifics). The processes end up as unkillable in „D“ state and we need to reboot the hosts to fix it. As the purpose of the hosts is not to run the agent, we usually let it hang and restart when it’s convenient. The systems are FreeBSD 10.3, though (running nginx and varnish in different jails).
Re: HaProxy Hang
> On 03 Mar 2017, at 19:36, David King <king.c.da...@googlemail.com> wrote: > > Thanks for the response! > Thats interesting, i don't suppose you have the details of the other issues? First report is https://www.mail-archive.com/haproxy@formilux.org/msg25060.html Second one https://www.mail-archive.com/haproxy@formilux.org/msg25067.html (in the same thread) > > Thanks > Dave > > On 3 March 2017 at 14:15, Dmitry Sivachenko <trtrmi...@gmail.com> wrote: > > > On 03 Mar 2017, at 17:07, David King <king.c.da...@googlemail.com> wrote: > > > > Hi All > > > > Hoping someone will be able to help, we're running a bit of an interesting > > setup > > > > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each > > running haproxy, but only one of the jails is under any real load > > > > > If my memory does not fail me this is third report on haproxy hang on FreeBSD > and all these reports are about FreeBSD-11. > > I wonder if any one experiences this issue with FreeBSD-10? > > I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version 1.6.9 > to be specific) and never experienced any hungs (knock the wood). > >
Re: HaProxy Hang
Thanks for the response! Thats interesting, i don't suppose you have the details of the other issues? Thanks Dave On 3 March 2017 at 14:15, Dmitry Sivachenko <trtrmi...@gmail.com> wrote: > > > On 03 Mar 2017, at 17:07, David King <king.c.da...@googlemail.com> > wrote: > > > > Hi All > > > > Hoping someone will be able to help, we're running a bit of an > interesting setup > > > > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, > each running haproxy, but only one of the jails is under any real load > > > > > If my memory does not fail me this is third report on haproxy hang on > FreeBSD and all these reports are about FreeBSD-11. > > I wonder if any one experiences this issue with FreeBSD-10? > > I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version > 1.6.9 to be specific) and never experienced any hungs (knock the wood). >
HaProxy Hang
Hi All Hoping someone will be able to help, we're running a bit of an interesting setup we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each running haproxy, but only one of the jails is under any real load we use CARP to balance between the hosts and jails which seems to be working fine about 1 once every 2/3 months, all the haproxy instances hang, the process keeps running, but doesn't access any more connections, the monitoring socket is unresponsive. it doesn't produce any errors in logs. these hangs all happen within a couple of seconds, over all jails on all hosts taking down our frontend network, a restart of the haproxy service fixes it. we use chef for config management, and all the run times are splayed, all the haproxy instances will have different up times Any one who has an good idea of what could cause this? Thanks!! haproxy -vv HA-Proxy version 1.7.2 2017/01/13 Copyright 2000-2017 Willy TarreauBuild options : TARGET = freebsd CPU = generic CC = cc CFLAGS = -O2 -pipe -fstack-protector -fno-strict-aliasing -DFREEBSD_PORTS OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1 USE_OPENSSL=1 USE_LUA=1 USE_STATIC_PCRE=1 USE_PCRE_JIT=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.8 Running on zlib version : 1.2.8 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with OpenSSL version : OpenSSL 1.0.2k-freebsd 26 Jan 2017 Running on OpenSSL version : OpenSSL 1.0.2j-freebsd 26 Sep 2016 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 8.40 2017-01-11 Running on PCRE version : 8.40 2017-01-11 PCRE library supports JIT : yes Built with Lua version : Lua 5.3.3 Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY Available polling systems : kqueue : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use kqueue. Available filters : [SPOE] spoe [TRACE] trace [COMP] compression
Re: HAProxy Hang during initial connection
Hi John, On Tue, May 27, 2014 at 08:08:27PM +, JDzialo John wrote: Hi Willy, Here is a capture of all traffic btwn the two servers using the host option. Thank you. Basically traffic goes from haproxy to a web farm in a round robin fashion. These individual web servers are accessing a single file server, (we are in the process of splitting this file server into multiple servers to spread the load). This one file server is getting slammed all day with requests and that may be the root of the problem but have not found the smoking gun necessary to prove it. Should I also get a capture btwn our web farm and the file server? I have also cc'd our network administrator to help come down to a solution. Let me know what you think and if there is anything else I can provide. As always thank you so much for your help. You have been a great help to me in narrowing down issues. I spent some time reading the captures and found nothing abnormal in them. Do you have any indication of a faulty session or request ? Also I noticed that you took the captures on the server itself and that the server has TSO enabled since we're seeing large frames. It would be possible that there's a bug in the network driver or NIC causing some frames to be lost for example. Maybe the same trace taken on the haproxy server at the same time would reveal some extra information. Note, you don't need to post the whole file to the list, there are about 800 people who are probably not interested in receiving this 6MB file :-) Either you can put it on a public server, or you can simply send it privately to me. Thanks, Willy
RE: HAProxy Hang during initial connection
Hi Willy Thanks, I'll send future traces to you directly. I understand the hatred of bulky email files! So I think I found the problem but would love your take on it. Our web applications and services in our haproxy backend are using keepalive in their connection headers. I understand in haproxy v1.4 keepalives are ok from the client side but not from the server side, correct? So I added the option to http-server-close on our haproxy web service server and it appears to have stopped this random half loaded data stream issue. Can you explain how having keepalive coming from the server side application connection headers could cause this issue? Could you give a brief description of what happens when haproxy receives a keepalive header but does not have http-server-close option set? Thanks as always for your help! -Original Message- From: Willy Tarreau [mailto:w...@1wt.eu] Sent: Wednesday, May 28, 2014 10:15 AM To: JDzialo John Cc: haproxy@formilux.org; AZabrecky Allan Subject: Re: HAProxy Hang during initial connection Hi John, On Tue, May 27, 2014 at 08:08:27PM +, JDzialo John wrote: Hi Willy, Here is a capture of all traffic btwn the two servers using the host option. Thank you. Basically traffic goes from haproxy to a web farm in a round robin fashion. These individual web servers are accessing a single file server, (we are in the process of splitting this file server into multiple servers to spread the load). This one file server is getting slammed all day with requests and that may be the root of the problem but have not found the smoking gun necessary to prove it. Should I also get a capture btwn our web farm and the file server? I have also cc'd our network administrator to help come down to a solution. Let me know what you think and if there is anything else I can provide. As always thank you so much for your help. You have been a great help to me in narrowing down issues. I spent some time reading the captures and found nothing abnormal in them. Do you have any indication of a faulty session or request ? Also I noticed that you took the captures on the server itself and that the server has TSO enabled since we're seeing large frames. It would be possible that there's a bug in the network driver or NIC causing some frames to be lost for example. Maybe the same trace taken on the haproxy server at the same time would reveal some extra information. Note, you don't need to post the whole file to the list, there are about 800 people who are probably not interested in receiving this 6MB file :-) Either you can put it on a public server, or you can simply send it privately to me. Thanks, Willy
Re: HAProxy Hang during initial connection
Hi John, On Wed, May 28, 2014 at 07:54:20PM +, JDzialo John wrote: Hi Willy Thanks, I'll send future traces to you directly. I understand the hatred of bulky email files! So I think I found the problem but would love your take on it. Our web applications and services in our haproxy backend are using keepalive in their connection headers. I understand in haproxy v1.4 keepalives are ok from the client side but not from the server side, correct? So I added the option to http-server-close on our haproxy web service server and it appears to have stopped this random half loaded data stream issue. Can you explain how having keepalive coming from the server side application connection headers could cause this issue? Could you give a brief description of what happens when haproxy receives a keepalive header but does not have http-server-close option set? OK so the client is not a browser, right ? When you're using option httpclose only, haproxy just modifies the headers to add close to the request and to the response, but does not perform any active close. I observed in the distant past (8 years ago) that some servers would not honnor the close and would expect the client to close after they get the complete response. Then I added option forceclose to close the server-side connection once the server starts to respond. By now we have a much more complete message parser which knows where the end is and which actively closes the connections as soon as you enable http-server-close or http-keep-alive (which is not available on 1.4). So what you found now is that your server ignores the close and expects the client to close, while the client expects the same from the server. Additionally it's possible that your client uses excess buffering and does not get the whole response until it gets the real close (possibly the one caused by haproxy's timeout). Usually these situations are detectable in haproxy's logs because you get very large total transfer times for a small server response time, and you notice that all transfer times are very close to the client or server timeout. Then you know that the connection remained open for that amount of time. Anyway, please use http-server-close or forceclose, it will do what you need by enforcing the close. I hope you get a clearer picture now. Regards, Willy
Re: HAProxy Hang during initial connection
Hi John, On Thu, May 15, 2014 at 05:49:59PM +, JDzialo John wrote: Hi Guys, We have been using haproxy and have loved it and appreciate all the hard work you have put into this great product. I am new to the project and still trying to grasp it's complexities so forgive me in advance for any ignorance. We have been running haproxy infront of a web farm since January. HAProxy 1.4.24 on a Debian 7 server performing a round robin balance across 6 web servers using IIS 7.5 and hosting a .NET application. All hosted in AWS. Recently we have started to see a strange issue arise. Every once in a while a browser request or a web service call will hang until our 10 minute client timeout is hit and the request fails. Using fiddler for testing once in a while I see a request initiate but never make the connection to a back end server. The request hangs and eventually fiddler reports a content length mismatch as our header declared a certain amount of data but the client only received a fraction of it. So it sounds like your server is sending a wrong content-length. The issue is random but happens pretty consistently throughout the day. This just started a few weeks ago and there were no changes on our HAProxy config made since February. Which would be consistent with a recent change on the server :-) Below find our configuration. # Global config global log 127.0.0.1 local0 log 127.0.0.1 local1 notice maxconn 10 stats socket /var/run/haproxy.sock mode 0600 level admin user haproxy group haproxy daemon # Default config defaults log global mode http option httplog option dontlognull option redispatch option forwardfor option httpclose option abortonclose retries 1 timeout connect 5000 timeout client 5 timeout server 5 listen stats #disabled bind *: stats enable stats uri /haproxy?stats stats realm Strictly\ Private stats auth :xx frontend unsecured *:80 timeout client 60 default_backend web backend web timeout server 60 balance roundrobin server web1 xxx.xxx.xxx.xxx:80 check server web2 xxx.xxx.xxx.xxx:80 check server web3 xxx.xxx.xxx.xxx:80 check server web4 xxx.xxx.xxx.xxx:80 check server web5 xxx.xxx.xxx.xxx:80 check server web6 xxx.xxx.xxx.xxx:80 check Is there anything in our configuration that could cause this weird behavior? Or anything I could add? No, haproxy will use content-length just like your client to know the response length, but will not modify it. How about kernel settings in sysctl? What are the optimal settings to run a haproxy server? These are totally unrelated. Here you're having a problem with less data being returned than advertised. So very likely your server is sending some abnormal contents. Have you tried taking a network capture between haproxy and the server to verify this ? Note that another explanation could be totally unrelated to content length and could simply be a bug causing haproxy to actually stop receiving or emitting data, and closing the connection after the timeout. That would also explain why your client receives less data than indicated in the content-length header. But if so, you should be able to tell whether or not the contents have been truncated. However I'm not seeing any known bug looking like this in 1.4.24, so while possible, this seems very strange since most people are using 1.4 :-/ Any help you can give I would really appreciate it? Please let me know if there is anything else I can provide. Please try to take a capture of the response from the server to haproxy so that we find there if the response size is correct, and/or if any special event happens. Please use something like this (assuming all your traffic flows by interface eth0) : tcpdump -s0 -npi eth0 -w trace-server.cap src server-ip If you have enough space on your disk, it can be even better to also capture the haproxy-to-client traffic : tcpdump -s0 -npi eth0 -w trace-client.cap dst client-ip Best regards, Willy
HAProxy Hang during initial connection
Hi Guys, We have been using haproxy and have loved it and appreciate all the hard work you have put into this great product. I am new to the project and still trying to grasp it's complexities so forgive me in advance for any ignorance. We have been running haproxy infront of a web farm since January. HAProxy 1.4.24 on a Debian 7 server performing a round robin balance across 6 web servers using IIS 7.5 and hosting a .NET application. All hosted in AWS. Recently we have started to see a strange issue arise. Every once in a while a browser request or a web service call will hang until our 10 minute client timeout is hit and the request fails. Using fiddler for testing once in a while I see a request initiate but never make the connection to a back end server. The request hangs and eventually fiddler reports a content length mismatch as our header declared a certain amount of data but the client only received a fraction of it. The issue is random but happens pretty consistently throughout the day. This just started a few weeks ago and there were no changes on our HAProxy config made since February. Below find our configuration. # Global config global log 127.0.0.1 local0 log 127.0.0.1 local1 notice maxconn 10 stats socket /var/run/haproxy.sock mode 0600 level admin user haproxy group haproxy daemon # Default config defaults log global mode http option httplog option dontlognull option redispatch option forwardfor option httpclose option abortonclose retries 1 timeout connect 5000 timeout client 5 timeout server 5 listen stats #disabled bind *: stats enable stats uri /haproxy?stats stats realm Strictly\ Private stats auth :xx frontend unsecured *:80 timeout client 60 default_backend web backend web timeout server 60 balance roundrobin server web1 xxx.xxx.xxx.xxx:80 check server web2 xxx.xxx.xxx.xxx:80 check server web3 xxx.xxx.xxx.xxx:80 check server web4 xxx.xxx.xxx.xxx:80 check server web5 xxx.xxx.xxx.xxx:80 check server web6 xxx.xxx.xxx.xxx:80 check Is there anything in our configuration that could cause this weird behavior? Or anything I could add? How about kernel settings in sysctl? What are the optimal settings to run a haproxy server? Any help you can give I would really appreciate it? Please let me know if there is anything else I can provide. John Dzialo | Linux System Administrator Direct 203.783.8163 | Main 800.352.0050 Environmental Data Resources, Inc. 440 Wheelers Farms Road, Milford, CT 06461 www.edrnet.comhttp://www.edrnet.com/ | commonground.edrnet.comhttp://commonground.edrnet.com/ [Description: Description: Description: Description: Description: EDR_logo4color_EDR_only_80px2]