Re: HaProxy Hang

2017-06-07 Thread Dave Cottlehuber
On Wed, 7 Jun 2017, at 10:42, David King wrote:
> Just to close the loop on this, last night was the time at which we were
> expecting the next hang. All of the servers we updated haproxy to the
> patched versions did not hang. The test servers which were running the
> older version hung as expected
> 
> Thanks so much to everyone who fixed the issue!

Same here, although as we patched everything we had no issues at all :D
Merci beaucoup!

A+
Dave



Re: HaProxy Hang

2017-06-07 Thread Willy Tarreau
Hi David,

On Wed, Jun 07, 2017 at 09:42:58AM +0100, David King wrote:
> Just to close the loop on this, last night was the time at which we were
> expecting the next hang. All of the servers we updated haproxy to the
> patched versions did not hang. The test servers which were running the
> older version hung as expected
> 
> Thanks so much to everyone who fixed the issue!

Feedback much appreciated, thank you! We need to issue 1.7.6 soon with
this fix but other troubling ones being under investigation have delayed
this a bit.

Cheers,
Willy



Re: HaProxy Hang

2017-06-07 Thread David King
Just to close the loop on this, last night was the time at which we were
expecting the next hang. All of the servers we updated haproxy to the
patched versions did not hang. The test servers which were running the
older version hung as expected

Thanks so much to everyone who fixed the issue!

On 18 April 2017 at 10:45, Willy Tarreau  wrote:

> Hi David,
>
> On Tue, Apr 18, 2017 at 10:33:40AM +0100, David King wrote:
> > Hi All
> >
> > Just like to confirm Willy's theory, we had the hang at exactly the time
> > specified this morning.
>
> I could recycle myself in a new church of which I would be the prophet...
> well maybe it already exists, we have thousands of adepts after all :-)
>
> More seriously, I think it will be useful to report a bug to the FreeBSD
> project, there are quite a number of elements, possibly nothing that can
> make it obvious where the problem could be, but a number of hypothesis
> can be ruled out already I think. It's possible that some FreeBSD devs
> ask us to monitor a few things or capture some syscall returns, or try
> some workarounds and this might require some dev. So in short, the earlier
> the better if we want to be ready for the next occurrence.
>
> Cheers,
> Willy
>


Re: HaProxy Hang

2017-04-18 Thread Willy Tarreau
Hi David,

On Tue, Apr 18, 2017 at 10:33:40AM +0100, David King wrote:
> Hi All
> 
> Just like to confirm Willy's theory, we had the hang at exactly the time
> specified this morning.

I could recycle myself in a new church of which I would be the prophet...
well maybe it already exists, we have thousands of adepts after all :-)

More seriously, I think it will be useful to report a bug to the FreeBSD
project, there are quite a number of elements, possibly nothing that can
make it obvious where the problem could be, but a number of hypothesis
can be ruled out already I think. It's possible that some FreeBSD devs
ask us to monitor a few things or capture some syscall returns, or try
some workarounds and this might require some dev. So in short, the earlier
the better if we want to be ready for the next occurrence.

Cheers,
Willy



Re: HaProxy Hang

2017-04-18 Thread David King
Hi All

Just like to confirm Willy's theory, we had the hang at exactly the time
specified this morning.

Sadly due to a bank holiday yesterday in the UK, we didn't set up the truss
and monitoring before the hang occurred.

Was the hang seen by everyone?

Thanks

Dave

On 6 April 2017 at 14:56, Mark S  wrote:

> On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber 
> wrote:
>
> On Mon, 13 Mar 2017, at 13:31, David King wrote:
>>
>>> Hi All
>>>
>>> Apologies for the delay in response, i've been out of the country for the
>>> last week
>>>
>>> Mark, my gut feeling is that is network related in someway, so thought we
>>> could compare the networking setup of our systems
>>>
>>> You mentioned you see the hang across geo locations, so i assume there
>>> isn't layer 2 connectivity between all of the hosts? is there any back
>>> end
>>> connectivity between the haproxy hosts?
>>>
>>
>> Following up on this, some interesting points but nothing useful.
>>
>> - Mark & I see the hang at almost exactly the same time on the same day:
>> 2017-02-27T14:36Z give or take a minute either way
>>
>> - I see the hang in 3 different regions using 2 different hosting
>> providers on both clustered and non-clustered services, but all on
>> FreeBSD 11.0R amd64. There is some dependency between these systems but
>> nothing unusual (logging backends, reverse proxied services etc).
>>
>> - our servers don't have a specific workload that would allow them all
>> to run out of some internal resource at the same time, as their reboot
>> and patch cycles are reasonably different - typically a few days elapse
>> between first patches and last reboots unless its deemed high risk
>>
>> - our networking setup is not complex but typical FreeBSD:
>> - LACP bonded Gbit igb(4) NICs
>> - CARP failover for both ipv4 & ipv6 addresses
>> - either direct to haproxy for http & TLS traffic, or via spiped to
>> decrypt intra-server traffic
>> - haproxy directs traffic into jailed services
>> - our overall load and throughput is low but consistent
>> - pf firewall
>> - rsyslog for logging, along with riemann and graphite for metrics
>> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
>> - haproxy 1.6.10 + libressl at the time
>>
>> As I'm not one for conspiracy theories or weird coincidences, somebody
>> port scanning the internet with an Unexpectedly Evil Packet Combo seems
>> the most plausible explanation.  I cannot find an alternative that would
>> fit the scenario of 3 different organisations with geographically
>> distributed equipment and unconnected services reporting an unusual
>> interruption on the same day and almost the same time.
>>
>> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
>> libressl and seen no recurrence, just like the last 8+ months or so
>> since first deploying haproxy on FreeBSD instead of debian & nginx.
>>
>> If the issue recurs I plan to run a small cyclic traffic capture with
>> tcpdump and wait for a re-repeat, see
>> https://superuser.com/questions/286062/practical-tcpdump-examples
>>
>> Let me know if I can help or clarify further.
>>
>> A+
>> Dave
>>
>
> Hi Dave,
>
> Thanks for keeping this thread going.  As for the initial report with all
> servers hanging, I too run NTP (actually OpenNTPd), and these only speak to
> in-house stratum-2 servers.
>
> As a follow-up to my initial report, I upgraded to 1.7.3 shortly
> thereafter.
>
> I've had one re-occurrence of this "hang" but this time, it did not affect
> all of my servers, instead, it affected only 2 (the busier ones).  If the
> theory about some timing event ( leap second, counter wrapping, etc.) is
> correct, perhaps it only affects processes actually accepting or handling a
> connection in a particular state at the time.
>
> I have not yet upgraded beyond 1.7.3.
>
> Best,
> -=Mark
>


Re: HaProxy Hang

2017-04-06 Thread Mark S
On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber   
wrote:



On Mon, 13 Mar 2017, at 13:31, David King wrote:

Hi All

Apologies for the delay in response, i've been out of the country for  
the

last week

Mark, my gut feeling is that is network related in someway, so thought  
we

could compare the networking setup of our systems

You mentioned you see the hang across geo locations, so i assume there
isn't layer 2 connectivity between all of the hosts? is there any back
end
connectivity between the haproxy hosts?


Following up on this, some interesting points but nothing useful.

- Mark & I see the hang at almost exactly the same time on the same day:
2017-02-27T14:36Z give or take a minute either way

- I see the hang in 3 different regions using 2 different hosting
providers on both clustered and non-clustered services, but all on
FreeBSD 11.0R amd64. There is some dependency between these systems but
nothing unusual (logging backends, reverse proxied services etc).

- our servers don't have a specific workload that would allow them all
to run out of some internal resource at the same time, as their reboot
and patch cycles are reasonably different - typically a few days elapse
between first patches and last reboots unless its deemed high risk

- our networking setup is not complex but typical FreeBSD:
- LACP bonded Gbit igb(4) NICs
- CARP failover for both ipv4 & ipv6 addresses
- either direct to haproxy for http & TLS traffic, or via spiped to
decrypt intra-server traffic
- haproxy directs traffic into jailed services
- our overall load and throughput is low but consistent
- pf firewall
- rsyslog for logging, along with riemann and graphite for metrics
- all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
- haproxy 1.6.10 + libressl at the time

As I'm not one for conspiracy theories or weird coincidences, somebody
port scanning the internet with an Unexpectedly Evil Packet Combo seems
the most plausible explanation.  I cannot find an alternative that would
fit the scenario of 3 different organisations with geographically
distributed equipment and unconnected services reporting an unusual
interruption on the same day and almost the same time.

Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
libressl and seen no recurrence, just like the last 8+ months or so
since first deploying haproxy on FreeBSD instead of debian & nginx.

If the issue recurs I plan to run a small cyclic traffic capture with
tcpdump and wait for a re-repeat, see
https://superuser.com/questions/286062/practical-tcpdump-examples

Let me know if I can help or clarify further.

A+
Dave


Hi Dave,

Thanks for keeping this thread going.  As for the initial report with all  
servers hanging, I too run NTP (actually OpenNTPd), and these only speak  
to in-house stratum-2 servers.


As a follow-up to my initial report, I upgraded to 1.7.3 shortly  
thereafter.


I've had one re-occurrence of this "hang" but this time, it did not affect  
all of my servers, instead, it affected only 2 (the busier ones).  If the  
theory about some timing event ( leap second, counter wrapping, etc.) is  
correct, perhaps it only affects processes actually accepting or handling  
a connection in a particular state at the time.


I have not yet upgraded beyond 1.7.3.

Best,
-=Mark



Re: HaProxy Hang

2017-04-05 Thread Willy Tarreau
On Wed, Apr 05, 2017 at 10:10:49AM +0100, David King wrote:
> I'm going to keep with version 1.7.2 till then, so we should have a
> comparison

OK as you like :-)

> If we think we may have a hang at Tue Apr 18, 9:38, is there any specific
> logging we should set up on a server at that time?

Maybe detailed truss output if it happens, to get all arguments and a few
things like this. Unfortunately for now I don't see an easy way to reset
the kqueue fd and reinitialize all events from scratch (though it's possible,
just requires quite some code and will come with some bugs).

> is it worth setting at
> least one server to have nokqueue set at that time?

Well, possibly if you have multiple servers and all of them die at the same
time, that could avoid a complete outage. And maybe nothing will happen, it
was a pure guess from me but given that these ones more or less match issues
we've had a long time ago with looping timers, I would not be surprized if
it happens this way.

Willy



Re: HaProxy Hang

2017-04-05 Thread David King
I'm going to keep with version 1.7.2 till then, so we should have a
comparison

If we think we may have a hang at Tue Apr 18, 9:38, is there any specific
logging we should set up on a server at that time? is it worth setting at
least one server to have nokqueue set at that time?

Thanks

David

On 5 April 2017 at 07:00, Willy Tarreau  wrote:

> Hi all,
>
> On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote:
> > Can we be absolutely positive that those hangs are not directly or
> > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for
> > example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD
> 11.0-p8"?
>
> I don't believe in this at all unfortunately. The issues that were faced
> on FreeBSD in earlier versions were related to connect() occasionally
> succeeding synchronously and haproxy did not handle this case cleanly
> (it initially used to poll then validate the connect() a second time,
> and fixing this broke the rest).
>
> > There maybe multiple and different symptoms of those bugs, so even if the
> > descriptions in those threads don't match your case 100%, it may still
> > caused by the same underlying bug.
> >
> > A confirmation that hose hangs are still happening in v1.7.5 would be
> > crucial.
>
> I'm pretty sure they will still happen.
>
> > The time co-incidence is intriguing, but I would not spend too much time
> > with that. Collecting actual traces (like strace or its freebsd
> equivalent)
> > and capture dumps is more likely to achieve progress, imo.
>
> In fact I do think there's an operating system issue here (and those who
> know me also know that I'm not one who tries to hide haproxy bugs). What
> I suspect is that there's a problem when time wraps. A 1 kHz scheduler
> wraps every 49.7 days. With clocks synchronized over NTP, all of them
> wrap exactly at the same time. If the issue is there, it may happen
> again on Tue Apr 18, 9:38 (13 days from now).
>
> It could have been haproxy's time wrapping and causing the issue, so I
> modified it to add an offset and make the time wrap 5s after startup,
> and couldn't trigger the problem on a FreeBSD system, even after
> multiple attempts. And the time of first crash reported above doesn't
> match any wrapping pattern (0x58b43950). Also, reporters indicated
> that the issue appeared after migrating to FreeBSD 11 and no such
> issue was ever reported on earlier versions.
>
> Also Dave reported this, which is totally abnormal :
>
>   kqueue(0,0,0) = 22 (EINVAL)
>
> and the fact that the system panicked, which cannot be an haproxy issue.
>
> Another point, Dave reported a loss of network connectivity at the
> same moment when it last happened. Dave, could this be related to
> other FreeBSD nodes running FreeBSD as well and rebooting or any
> such thing ?
>
> I think that at this point we should discuss with some FreeBSD
> maintainers and see what can be done to track this problem down, even
> if it means adding some debugging code in the kqueue loop to help
> troubleshoot this, or using it differently if we're doing something
> wrong.
>
> Given that Mark indicated that reloading the process fixed the problem
> (except he had to manually kill the previous one), one possible workaround
> might be to detect the EINVAL, and try to reinitialize kqueue or switch
> to poll() if this happens (and emit loud warnings in the logs).
>
> > Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish
> y'all
> > a good night,
>
> There's still a faint possibility of a widespread attack but while I
> can easily imagine some such devices sending a "packet of death"
> exploiting a bug in an OS, I don't believe it would make kqueue()
> return EINVAL in haproxy.
>
> Cheers,
> Willy
>


Re: HaProxy Hang

2017-04-05 Thread Willy Tarreau
Hi all,

On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote:
> Can we be absolutely positive that those hangs are not directly or
> indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for
> example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 11.0-p8"?

I don't believe in this at all unfortunately. The issues that were faced
on FreeBSD in earlier versions were related to connect() occasionally
succeeding synchronously and haproxy did not handle this case cleanly
(it initially used to poll then validate the connect() a second time,
and fixing this broke the rest).

> There maybe multiple and different symptoms of those bugs, so even if the
> descriptions in those threads don't match your case 100%, it may still
> caused by the same underlying bug.
> 
> A confirmation that hose hangs are still happening in v1.7.5 would be
> crucial.

I'm pretty sure they will still happen.

> The time co-incidence is intriguing, but I would not spend too much time
> with that. Collecting actual traces (like strace or its freebsd equivalent)
> and capture dumps is more likely to achieve progress, imo.

In fact I do think there's an operating system issue here (and those who
know me also know that I'm not one who tries to hide haproxy bugs). What
I suspect is that there's a problem when time wraps. A 1 kHz scheduler
wraps every 49.7 days. With clocks synchronized over NTP, all of them
wrap exactly at the same time. If the issue is there, it may happen
again on Tue Apr 18, 9:38 (13 days from now).

It could have been haproxy's time wrapping and causing the issue, so I
modified it to add an offset and make the time wrap 5s after startup,
and couldn't trigger the problem on a FreeBSD system, even after
multiple attempts. And the time of first crash reported above doesn't
match any wrapping pattern (0x58b43950). Also, reporters indicated
that the issue appeared after migrating to FreeBSD 11 and no such
issue was ever reported on earlier versions.

Also Dave reported this, which is totally abnormal :

  kqueue(0,0,0) = 22 (EINVAL)

and the fact that the system panicked, which cannot be an haproxy issue.

Another point, Dave reported a loss of network connectivity at the
same moment when it last happened. Dave, could this be related to
other FreeBSD nodes running FreeBSD as well and rebooting or any
such thing ?

I think that at this point we should discuss with some FreeBSD
maintainers and see what can be done to track this problem down, even
if it means adding some debugging code in the kqueue loop to help
troubleshoot this, or using it differently if we're doing something
wrong.

Given that Mark indicated that reloading the process fixed the problem
(except he had to manually kill the previous one), one possible workaround 
might be to detect the EINVAL, and try to reinitialize kqueue or switch
to poll() if this happens (and emit loud warnings in the logs).

> Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish y'all
> a good night,

There's still a faint possibility of a widespread attack but while I
can easily imagine some such devices sending a "packet of death"
exploiting a bug in an OS, I don't believe it would make kqueue()
return EINVAL in haproxy.

Cheers,
Willy



Re: HaProxy Hang

2017-04-04 Thread Dave Cottlehuber
On Wed, 5 Apr 2017, at 01:34, Lukas Tribus wrote:
> Hello,
> 
> 
> Am 05.04.2017 um 00:27 schrieb David King:
> > Hi Dave
> >
> > Thanks for the info, So interestingly we had the crash at exactly the 
> > same time, so we are 3 for 3 on that
> >
> > The setups sounds very similar, but given we all saw issue at the same 
> > time, it really points to something more global.
> >
> > We are using NTP from our firewalls, which in turn get it from our 
> > ISP, so i doubt that is the cause, so it could be external port 
> > scanning which is the cause as you suggest. or maybe a leap second of 
> > some sort?
> >
> > Willy any thoughts on the time co-incidence?
> 
> Can we be absolutely positive that those hangs are not directly or 
> indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, 
> for example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 
> 11.0-p8"?
>
> There maybe multiple and different symptoms of those bugs, so even if 
> the descriptions in those threads don't match your case 100%, it may 
> still caused by the same underlying bug.

I'll update from 1.7.3 to 1.7.5 with those goodies tomorrow and see how
that goes.

A+
Dave



Re: HaProxy Hang

2017-04-04 Thread Lukas Tribus

Hello,


Am 05.04.2017 um 00:27 schrieb David King:

Hi Dave

Thanks for the info, So interestingly we had the crash at exactly the 
same time, so we are 3 for 3 on that


The setups sounds very similar, but given we all saw issue at the same 
time, it really points to something more global.


We are using NTP from our firewalls, which in turn get it from our 
ISP, so i doubt that is the cause, so it could be external port 
scanning which is the cause as you suggest. or maybe a leap second of 
some sort?


Willy any thoughts on the time co-incidence?


Can we be absolutely positive that those hangs are not directly or 
indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, 
for example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 
11.0-p8"?
There maybe multiple and different symptoms of those bugs, so even if 
the descriptions in those threads don't match your case 100%, it may 
still caused by the same underlying bug.


A confirmation that hose hangs are still happening in v1.7.5 would be 
crucial.


The time co-incidence is intriguing, but I would not spend too much time 
with that. Collecting actual traces (like strace or its freebsd 
equivalent) and capture dumps is more likely to achieve progress, imo.



Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish 
y'all a good night,

lukas




Re: HaProxy Hang

2017-04-04 Thread David King
Hi Dave

Thanks for the info, So interestingly we had the crash at exactly the same
time, so we are 3 for 3 on that

The setups sounds very similar, but given we all saw issue at the same
time, it really points to something more global.

We are using NTP from our firewalls, which in turn get it from our ISP, so
i doubt that is the cause, so it could be external port scanning which is
the cause as you suggest. or maybe a leap second of some sort?

Willy any thoughts on the time co-incidence?

Thanks

Dave





On 3 April 2017 at 17:45, Dave Cottlehuber  wrote:

> On Mon, 13 Mar 2017, at 13:31, David King wrote:
> > Hi All
> >
> > Apologies for the delay in response, i've been out of the country for the
> > last week
> >
> > Mark, my gut feeling is that is network related in someway, so thought we
> > could compare the networking setup of our systems
> >
> > You mentioned you see the hang across geo locations, so i assume there
> > isn't layer 2 connectivity between all of the hosts? is there any back
> > end
> > connectivity between the haproxy hosts?
>
> Following up on this, some interesting points but nothing useful.
>
> - Mark & I see the hang at almost exactly the same time on the same day:
> 2017-02-27T14:36Z give or take a minute either way
>
> - I see the hang in 3 different regions using 2 different hosting
> providers on both clustered and non-clustered services, but all on
> FreeBSD 11.0R amd64. There is some dependency between these systems but
> nothing unusual (logging backends, reverse proxied services etc).
>
> - our servers don't have a specific workload that would allow them all
> to run out of some internal resource at the same time, as their reboot
> and patch cycles are reasonably different - typically a few days elapse
> between first patches and last reboots unless its deemed high risk
>
> - our networking setup is not complex but typical FreeBSD:
> - LACP bonded Gbit igb(4) NICs
> - CARP failover for both ipv4 & ipv6 addresses
> - either direct to haproxy for http & TLS traffic, or via spiped to
> decrypt intra-server traffic
> - haproxy directs traffic into jailed services
> - our overall load and throughput is low but consistent
> - pf firewall
> - rsyslog for logging, along with riemann and graphite for metrics
> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
> - haproxy 1.6.10 + libressl at the time
>
> As I'm not one for conspiracy theories or weird coincidences, somebody
> port scanning the internet with an Unexpectedly Evil Packet Combo seems
> the most plausible explanation.  I cannot find an alternative that would
> fit the scenario of 3 different organisations with geographically
> distributed equipment and unconnected services reporting an unusual
> interruption on the same day and almost the same time.
>
> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
> libressl and seen no recurrence, just like the last 8+ months or so
> since first deploying haproxy on FreeBSD instead of debian & nginx.
>
> If the issue recurs I plan to run a small cyclic traffic capture with
> tcpdump and wait for a re-repeat, see
> https://superuser.com/questions/286062/practical-tcpdump-examples
>
> Let me know if I can help or clarify further.
>
> A+
> Dave
>


Re: HaProxy Hang

2017-04-03 Thread Dave Cottlehuber
On Mon, 13 Mar 2017, at 13:31, David King wrote:
> Hi All
> 
> Apologies for the delay in response, i've been out of the country for the
> last week
> 
> Mark, my gut feeling is that is network related in someway, so thought we
> could compare the networking setup of our systems
> 
> You mentioned you see the hang across geo locations, so i assume there
> isn't layer 2 connectivity between all of the hosts? is there any back
> end
> connectivity between the haproxy hosts?

Following up on this, some interesting points but nothing useful.

- Mark & I see the hang at almost exactly the same time on the same day:
2017-02-27T14:36Z give or take a minute either way

- I see the hang in 3 different regions using 2 different hosting
providers on both clustered and non-clustered services, but all on
FreeBSD 11.0R amd64. There is some dependency between these systems but
nothing unusual (logging backends, reverse proxied services etc).

- our servers don't have a specific workload that would allow them all
to run out of some internal resource at the same time, as their reboot
and patch cycles are reasonably different - typically a few days elapse
between first patches and last reboots unless its deemed high risk

- our networking setup is not complex but typical FreeBSD:
- LACP bonded Gbit igb(4) NICs
- CARP failover for both ipv4 & ipv6 addresses
- either direct to haproxy for http & TLS traffic, or via spiped to
decrypt intra-server traffic 
- haproxy directs traffic into jailed services
- our overall load and throughput is low but consistent
- pf firewall
- rsyslog for logging, along with riemann and graphite for metrics
- all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
- haproxy 1.6.10 + libressl at the time

As I'm not one for conspiracy theories or weird coincidences, somebody
port scanning the internet with an Unexpectedly Evil Packet Combo seems
the most plausible explanation.  I cannot find an alternative that would
fit the scenario of 3 different organisations with geographically
distributed equipment and unconnected services reporting an unusual
interruption on the same day and almost the same time.

Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
libressl and seen no recurrence, just like the last 8+ months or so
since first deploying haproxy on FreeBSD instead of debian & nginx.

If the issue recurs I plan to run a small cyclic traffic capture with
tcpdump and wait for a re-repeat, see
https://superuser.com/questions/286062/practical-tcpdump-examples

Let me know if I can help or clarify further.

A+
Dave



Re: HaProxy Hang

2017-03-13 Thread David King
Hi All

Apologies for the delay in response, i've been out of the country for the
last week

Mark, my gut feeling is that is network related in someway, so thought we
could compare the networking setup of our systems

You mentioned you see the hang across geo locations, so i assume there
isn't layer 2 connectivity between all of the hosts? is there any back end
connectivity between the haproxy hosts?

Ours are all layer 2 but are fairly complex. We have 6 connected NIC's
which are bonded into 3 LACP groups. over the top of the LACP we have a
number of VLAN interfaces. we also have a couple of normal IP aliases and a
number of CARP IP's on top of that

One commonality is NTP as they all sync from our own upstream NTP services,
but having looked through the logs, there isn't a recent NTP update when
the hang occurs and i can't see any time jump

other things which are set up on the host:
local rsyslog which sends logs to centralised host
we have crons every minute for each jail (4 jails) to monitor the health of
the haproxy service
we have crons every minute for each jail (4 jails) to gather stats from
haproxy using haproxy stats frontend
we run pf on the host
Chef runs every 30 mins, and these times are splayed

does anything match up on these which could cause these issues?

Thanks

Dave



On 6 March 2017 at 20:28, Mark S  wrote:

> On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreau  wrote:
>
> OK so that means that haproxy could have hung in a day or two, then your
>> case is much more common than one of the other reports. If your fdront LB
>> is fair between the 6 servers, that could be related to a total number of
>> requests or connections or something like this.
>>
>
> Another relevant point is that these servers are tied together using
> upstream, GeoIP-based DNS load balancing.  So the request rate across
> servers varies quite a bit depending on the location.  This would make a
> synchronized failure based on total requests less likely.
>
> I'm thinking about other things :
>>   - if you're doing a lot of SSL we could imagine an issue with random
>> generation using /dev/random instead of /dev/urandom. I've met this
>> issue a long time ago on some apache servers where all the entropy
>> was progressively consumed until it was not possible anymore to get
>> a connection.
>>
>
> I'll set up a script to capture the netstat and other info prior to
> reloading should this issue re-occur.
>
> As for SSL, yes, we do a fair bit of SSL ( about 30% of total request
> count ) and HAProxy does the TLS termination and then hands off via TCP
> proxy.
>
> Best,
> -=Mark S.
>


Re: HaProxy Hang

2017-03-06 Thread Jerry Scharf

Willy,

per your comment on /dev/random exhaustion. I think running haveged on 
servers doing crypto work is/should be best practice.


jerry
On 3/6/17 12:02 PM, Willy Tarreau wrote:

Hi Mark,

On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote:

As for the timing issue, I can add to the discussion with a few related data
points.  In short, system uptime does not seem to be a commonality to my
situation.

thanks!


1) I had this issue affect 6 servers, spread across 5 data centers (only 2
servers are in the same facility.)  All servers stopped processing requests
at roughly the same moment, certainly within the same minute.  All servers
running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against
OpenSSL-1.0.2k

OK.


2) System uptime was not at all similar across these servers, although
chances are most servers HAProxy process start time would be similar.  The
servers with the highest system uptime were at about 27 days at the time of
the incident, while the shortest were under a day or two.

OK so that means that haproxy could have hung in a day or two, then your
case is much more common than one of the other reports. If your fdront LB
is fair between the 6 servers, that could be related to a total number of
requests or connections or something like this.


3) HAProxy configurations are similar, but not exactly consistent between
servers - different IPs on the frontend, different ACLs and backends.

OK.


4) The only synchronized application common to all of these servers is
OpenNTPd.

Is there any risk that the ntpd causes time jumps in the future or in
the past for whatever reasons ? Maybe there's something with kqueue and
time jumps in recent versions ?


5) I have since upgraded to HAProxy-1.7.3, same build process: the full
version output is below - and will of course report any observed issues.

haproxy -vv
HA-Proxy version 1.7.3 2017/02/28

(...)

Everything there looks pretty standard. If it dies again it could be good
to try with "nokqueue" in the global section (or start haproxy with -dk)
to disable kqueue and switch to poll. It will eat a bit more CPU, so don't
do this on all nodes at once.

I'm thinking about other things :
   - if you're doing a lot of SSL we could imagine an issue with random
 generation using /dev/random instead of /dev/urandom. I've met this
 issue a long time ago on some apache servers where all the entropy
 was progressively consumed until it was not possible anymore to get
 a connection.

   - it could be useful to run "netstat -an" on a dead node before killing
 haproxy and archive this for later analysis. It may reveal that all
 file descriptors were used by close_wait connections (indicating a
 close bug in haproxy) or something like this. If instead you see a
 lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some
 external firewall or pf blocking some final traffic and leading to
 socket space exhaustion.

If you have the same issue that was reported with kevent() being called
in loops and returning an error, you may definitely see tons of close_wait
and it will indicate an issue with this poller, though I have no idea
which one, especially since it doesn't change often and *seems* to work
with previous versions.

Best regards,
Willy



--
Soundhound Devops
"What could possibly go wrong?"




Re: HaProxy Hang

2017-03-06 Thread Mark S

On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreau  wrote:


OK so that means that haproxy could have hung in a day or two, then your
case is much more common than one of the other reports. If your fdront LB
is fair between the 6 servers, that could be related to a total number of
requests or connections or something like this.


Another relevant point is that these servers are tied together using  
upstream, GeoIP-based DNS load balancing.  So the request rate across  
servers varies quite a bit depending on the location.  This would make a  
synchronized failure based on total requests less likely.



I'm thinking about other things :
  - if you're doing a lot of SSL we could imagine an issue with random
generation using /dev/random instead of /dev/urandom. I've met this
issue a long time ago on some apache servers where all the entropy
was progressively consumed until it was not possible anymore to get
a connection.


I'll set up a script to capture the netstat and other info prior to  
reloading should this issue re-occur.


As for SSL, yes, we do a fair bit of SSL ( about 30% of total request  
count ) and HAProxy does the TLS termination and then hands off via TCP  
proxy.


Best,
-=Mark S.



Re: HaProxy Hang

2017-03-06 Thread Willy Tarreau
Hi Mark,

On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote:
> As for the timing issue, I can add to the discussion with a few related data
> points.  In short, system uptime does not seem to be a commonality to my
> situation.

thanks!

> 1) I had this issue affect 6 servers, spread across 5 data centers (only 2
> servers are in the same facility.)  All servers stopped processing requests
> at roughly the same moment, certainly within the same minute.  All servers
> running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against
> OpenSSL-1.0.2k

OK.

> 2) System uptime was not at all similar across these servers, although
> chances are most servers HAProxy process start time would be similar.  The
> servers with the highest system uptime were at about 27 days at the time of
> the incident, while the shortest were under a day or two.

OK so that means that haproxy could have hung in a day or two, then your
case is much more common than one of the other reports. If your fdront LB
is fair between the 6 servers, that could be related to a total number of
requests or connections or something like this.

> 3) HAProxy configurations are similar, but not exactly consistent between
> servers - different IPs on the frontend, different ACLs and backends.

OK.

> 4) The only synchronized application common to all of these servers is
> OpenNTPd.

Is there any risk that the ntpd causes time jumps in the future or in
the past for whatever reasons ? Maybe there's something with kqueue and
time jumps in recent versions ?

> 5) I have since upgraded to HAProxy-1.7.3, same build process: the full
> version output is below - and will of course report any observed issues.
> 
> haproxy -vv
> HA-Proxy version 1.7.3 2017/02/28
(...)

Everything there looks pretty standard. If it dies again it could be good
to try with "nokqueue" in the global section (or start haproxy with -dk)
to disable kqueue and switch to poll. It will eat a bit more CPU, so don't
do this on all nodes at once.

I'm thinking about other things :
  - if you're doing a lot of SSL we could imagine an issue with random
generation using /dev/random instead of /dev/urandom. I've met this
issue a long time ago on some apache servers where all the entropy
was progressively consumed until it was not possible anymore to get
a connection.

  - it could be useful to run "netstat -an" on a dead node before killing
haproxy and archive this for later analysis. It may reveal that all
file descriptors were used by close_wait connections (indicating a
close bug in haproxy) or something like this. If instead you see a
lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some
external firewall or pf blocking some final traffic and leading to
socket space exhaustion.

If you have the same issue that was reported with kevent() being called
in loops and returning an error, you may definitely see tons of close_wait
and it will indicate an issue with this poller, though I have no idea
which one, especially since it doesn't change often and *seems* to work
with previous versions.

Best regards,
Willy



Re: HaProxy Hang

2017-03-06 Thread Mark S

On Mon, 06 Mar 2017 01:35:19 -0500, Willy Tarreau  wrote:


On Fri, Mar 03, 2017 at 07:54:46PM +0300, Dmitry Sivachenko wrote:


> On 03 Mar 2017, at 19:36, David King   
wrote:

>
> Thanks for the response!
> Thats interesting, i don't suppose you have the details of the other  
issues?



First report is
https://www.mail-archive.com/haproxy@formilux.org/msg25060.html
Second one
https://www.mail-archive.com/haproxy@formilux.org/msg25067.html


Thanks for the links Dmitry.

That's indeed really odd. If all hang at the same time, timing or uptime
looks like a good candidate. There's not much which is really specific
to FreeBSD in haproxy. However, the kqueue poller is only used there
(and on OpenBSD), and uses timing for the timeout. Thus it sounds likely
that there could be an issue there, either in haproxy or FreeBSD.

A hang every 2-3 months makes me think about the 49.7 days it takes for
a millisecond counter to wrap. These bugs are hard to troubleshoot. We
used to have such an issue a long time ago in linux 2.4 when the timer
was set to 100 Hz, it required 497 days to know whether the bug was
solved or not (obviously it now is).

I've just compared ev_epoll.c and ev_kqueue.c in case I could spot
anything obvious but from what I'm seeing they're pretty much similar
so I don't see what there could cause this bug. And since it apparently
works fine on FreeBSD 10, at best one of our bugs could only trigger a
system bug if it exists.

David, if your workload permits it, you can disable kqueue and haproxy
will automatically fall back to poll. For this you can simply put
"nokqueue" in the global section. poll() doesn't scale as well as
kqueue(), it's cheaper on low connection counts but it will use more
CPU above ~1000 concurrent connections.

Regards,
Willy



Hi Willy,

As for the timing issue, I can add to the discussion with a few related  
data points.  In short, system uptime does not seem to be a commonality to  
my situation.


1) I had this issue affect 6 servers, spread across 5 data centers (only 2  
servers are in the same facility.)  All servers stopped processing  
requests at roughly the same moment, certainly within the same minute.   
All servers running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally  
against OpenSSL-1.0.2k


2) System uptime was not at all similar across these servers, although  
chances are most servers HAProxy process start time would be similar.  The  
servers with the highest system uptime were at about 27 days at the time  
of the incident, while the shortest were under a day or two.


3) HAProxy configurations are similar, but not exactly consistent between  
servers - different IPs on the frontend, different ACLs and backends.


4) The only synchronized application common to all of these servers is  
OpenNTPd.


5) I have since upgraded to HAProxy-1.7.3, same build process: the full  
version output is below - and will of course report any observed issues.


haproxy -vv
HA-Proxy version 1.7.3 2017/02/28
Copyright 2000-2017 Willy Tarreau 

Build options :
  TARGET  = freebsd
  CPU = generic
  CC  = clang
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
  OPTIONS = USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built without compression support (neither USE_ZLIB nor USE_SLZ are set)
Compression algorithms supported : identity("identity")
Built with OpenSSL version : OpenSSL 1.0.2k  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY

Available polling systems :
 kqueue : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
[SPOE] spoe
[TRACE] trace
[COMP] compression

Cheers,
-=Mark



Re: HaProxy Hang

2017-03-05 Thread Willy Tarreau
On Fri, Mar 03, 2017 at 07:54:46PM +0300, Dmitry Sivachenko wrote:
> 
> > On 03 Mar 2017, at 19:36, David King  wrote:
> > 
> > Thanks for the response!
> > Thats interesting, i don't suppose you have the details of the other issues?
> 
> 
> First report is 
> https://www.mail-archive.com/haproxy@formilux.org/msg25060.html
> Second one
> https://www.mail-archive.com/haproxy@formilux.org/msg25067.html

Thanks for the links Dmitry.

That's indeed really odd. If all hang at the same time, timing or uptime
looks like a good candidate. There's not much which is really specific
to FreeBSD in haproxy. However, the kqueue poller is only used there
(and on OpenBSD), and uses timing for the timeout. Thus it sounds likely
that there could be an issue there, either in haproxy or FreeBSD.

A hang every 2-3 months makes me think about the 49.7 days it takes for
a millisecond counter to wrap. These bugs are hard to troubleshoot. We
used to have such an issue a long time ago in linux 2.4 when the timer
was set to 100 Hz, it required 497 days to know whether the bug was
solved or not (obviously it now is).

I've just compared ev_epoll.c and ev_kqueue.c in case I could spot
anything obvious but from what I'm seeing they're pretty much similar
so I don't see what there could cause this bug. And since it apparently
works fine on FreeBSD 10, at best one of our bugs could only trigger a
system bug if it exists.

David, if your workload permits it, you can disable kqueue and haproxy
will automatically fall back to poll. For this you can simply put
"nokqueue" in the global section. poll() doesn't scale as well as
kqueue(), it's cheaper on low connection counts but it will use more
CPU above ~1000 concurrent connections.

Regards,
Willy



Re: HaProxy Hang

2017-03-03 Thread Rainer Duffner

> Am 03.03.2017 um 15:07 schrieb David King :
> 
> Hi All
> 
> Hoping someone will be able to help, we're running a bit of an interesting 
> setup
> 
> we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each 
> running haproxy, but only one of the jails is under any real load
> 
> 


Do you use ZFS?


We have an internal software (some sort of monitoring agent) that also hangs in 
jails, from time to time.

The guy who wrote it found out it’s because of mmap (I don’t know the 
specifics).

The processes end up as unkillable in „D“ state and we need to reboot the hosts 
to fix it.

As the purpose of the hosts is not to run the agent, we usually let it hang and 
restart when it’s convenient.


The systems are FreeBSD 10.3, though (running nginx and varnish in different 
jails).






Re: HaProxy Hang

2017-03-03 Thread Dmitry Sivachenko

> On 03 Mar 2017, at 19:36, David King <king.c.da...@googlemail.com> wrote:
> 
> Thanks for the response!
> Thats interesting, i don't suppose you have the details of the other issues?


First report is 
https://www.mail-archive.com/haproxy@formilux.org/msg25060.html
Second one
https://www.mail-archive.com/haproxy@formilux.org/msg25067.html

(in the same thread)



> 
> Thanks
> Dave 
> 
> On 3 March 2017 at 14:15, Dmitry Sivachenko <trtrmi...@gmail.com> wrote:
> 
> > On 03 Mar 2017, at 17:07, David King <king.c.da...@googlemail.com> wrote:
> >
> > Hi All
> >
> > Hoping someone will be able to help, we're running a bit of an interesting 
> > setup
> >
> > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each 
> > running haproxy, but only one of the jails is under any real load
> >
> 
> 
> If my memory does not fail me this is third report on haproxy hang on FreeBSD 
> and all these reports are about FreeBSD-11.
> 
> I wonder if any one experiences this issue with FreeBSD-10?
> 
> I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version 1.6.9 
> to be specific) and never experienced any hungs (knock the wood).
>  
> 




Re: HaProxy Hang

2017-03-03 Thread David King
Thanks for the response!
Thats interesting, i don't suppose you have the details of the other issues?

Thanks
Dave

On 3 March 2017 at 14:15, Dmitry Sivachenko <trtrmi...@gmail.com> wrote:

>
> > On 03 Mar 2017, at 17:07, David King <king.c.da...@googlemail.com>
> wrote:
> >
> > Hi All
> >
> > Hoping someone will be able to help, we're running a bit of an
> interesting setup
> >
> > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails,
> each running haproxy, but only one of the jails is under any real load
> >
>
>
> If my memory does not fail me this is third report on haproxy hang on
> FreeBSD and all these reports are about FreeBSD-11.
>
> I wonder if any one experiences this issue with FreeBSD-10?
>
> I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version
> 1.6.9 to be specific) and never experienced any hungs (knock the wood).
>


HaProxy Hang

2017-03-03 Thread David King
Hi All

Hoping someone will be able to help, we're running a bit of an interesting
setup

we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each
running haproxy, but only one of the jails is under any real load

we use CARP to balance between the hosts and jails which seems to be
working fine

about 1 once every 2/3 months, all the haproxy instances hang, the process
keeps running, but doesn't access any more connections, the monitoring
socket is unresponsive. it doesn't produce any errors in logs.

these hangs all happen within a couple of seconds, over all jails on all
hosts taking down our frontend network, a restart of the haproxy service
fixes it.

we use chef for config management, and all the run times are splayed, all
the haproxy instances will have different up times

Any one who has an good idea of what could cause this?

Thanks!!



haproxy -vv

HA-Proxy version 1.7.2 2017/01/13

Copyright 2000-2017 Willy Tarreau 


Build options :

  TARGET  = freebsd

  CPU = generic

  CC  = cc

  CFLAGS  = -O2 -pipe -fstack-protector -fno-strict-aliasing -DFREEBSD_PORTS

  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1 USE_OPENSSL=1
USE_LUA=1 USE_STATIC_PCRE=1 USE_PCRE_JIT=1


Default settings :

  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200


Encrypted password support via crypt(3): yes

Built with zlib version : 1.2.8

Running on zlib version : 1.2.8

Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Built with OpenSSL version : OpenSSL 1.0.2k-freebsd  26 Jan 2017

Running on OpenSSL version : OpenSSL 1.0.2j-freebsd  26 Sep 2016

OpenSSL library supports TLS extensions : yes

OpenSSL library supports SNI : yes

OpenSSL library supports prefer-server-ciphers : yes

Built with PCRE version : 8.40 2017-01-11

Running on PCRE version : 8.40 2017-01-11

PCRE library supports JIT : yes

Built with Lua version : Lua 5.3.3

Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY


Available polling systems :

 kqueue : pref=300,  test result OK

   poll : pref=200,  test result OK

 select : pref=150,  test result OK

Total: 3 (3 usable), will use kqueue.


Available filters :

[SPOE] spoe

[TRACE] trace

[COMP] compression


Re: HAProxy Hang during initial connection

2014-05-28 Thread Willy Tarreau
Hi John,

On Tue, May 27, 2014 at 08:08:27PM +, JDzialo John wrote:
 Hi Willy,
 
 Here is a capture of all traffic btwn the two servers using the host option.

Thank you.

 Basically traffic goes from haproxy to a web farm in a round robin fashion.
 These individual web servers are accessing a single file server, (we are in
 the process of splitting this file server into multiple servers to spread the
 load).  This one file server is getting slammed all day with requests and
 that may be the root of the problem but have not found the smoking gun
 necessary to prove it.
 
 Should I also get a capture btwn our web farm and the file server?
 
 I have also cc'd our network administrator to help come down to a solution.
 
 Let me know what you think and if there is anything else I can provide.
 
 As always thank you so much for your help.  You have been a great help to me
 in narrowing down issues.

I spent some time reading the captures and found nothing abnormal in them. Do
you have any indication of a faulty session or request ?

Also I noticed that you took the captures on the server itself and that the
server has TSO enabled since we're seeing large frames. It would be possible
that there's a bug in the network driver or NIC causing some frames to be lost
for example. Maybe the same trace taken on the haproxy server at the same time
would reveal some extra information.

Note, you don't need to post the whole file to the list, there are about 800
people who are probably not interested in receiving this 6MB file :-)

Either you can put it on a public server, or you can simply send it privately
to me.

Thanks,
Willy




RE: HAProxy Hang during initial connection

2014-05-28 Thread JDzialo John
Hi Willy

Thanks, I'll send future traces to you directly.  I understand the hatred of 
bulky email files!

So I think I found the problem but would love your take on it.

Our web applications and services in our haproxy backend are using keepalive in 
their connection headers.  I understand in haproxy v1.4 keepalives are ok from 
the client side but not from the server side, correct?

So I added the option to http-server-close on our haproxy web service server 
and it appears to have stopped this random half loaded data stream issue.

Can you explain how having keepalive coming from the server side application 
connection headers could cause this issue?

Could you give a brief description of what happens when haproxy receives a 
keepalive header but does not have http-server-close option set?

Thanks as always for your help!



-Original Message-
From: Willy Tarreau [mailto:w...@1wt.eu] 
Sent: Wednesday, May 28, 2014 10:15 AM
To: JDzialo John
Cc: haproxy@formilux.org; AZabrecky Allan
Subject: Re: HAProxy Hang during initial connection

Hi John,

On Tue, May 27, 2014 at 08:08:27PM +, JDzialo John wrote:
 Hi Willy,
 
 Here is a capture of all traffic btwn the two servers using the host option.

Thank you.

 Basically traffic goes from haproxy to a web farm in a round robin fashion.
 These individual web servers are accessing a single file server, (we 
 are in the process of splitting this file server into multiple servers 
 to spread the load).  This one file server is getting slammed all day 
 with requests and that may be the root of the problem but have not 
 found the smoking gun necessary to prove it.
 
 Should I also get a capture btwn our web farm and the file server?
 
 I have also cc'd our network administrator to help come down to a solution.
 
 Let me know what you think and if there is anything else I can provide.
 
 As always thank you so much for your help.  You have been a great help 
 to me in narrowing down issues.

I spent some time reading the captures and found nothing abnormal in them. Do 
you have any indication of a faulty session or request ?

Also I noticed that you took the captures on the server itself and that the 
server has TSO enabled since we're seeing large frames. It would be possible 
that there's a bug in the network driver or NIC causing some frames to be lost 
for example. Maybe the same trace taken on the haproxy server at the same time 
would reveal some extra information.

Note, you don't need to post the whole file to the list, there are about 800 
people who are probably not interested in receiving this 6MB file :-)

Either you can put it on a public server, or you can simply send it privately 
to me.

Thanks,
Willy




Re: HAProxy Hang during initial connection

2014-05-28 Thread Willy Tarreau
Hi John,

On Wed, May 28, 2014 at 07:54:20PM +, JDzialo John wrote:
 Hi Willy
 
 Thanks, I'll send future traces to you directly.  I understand the hatred of 
 bulky email files!
 
 So I think I found the problem but would love your take on it.
 
 Our web applications and services in our haproxy backend are using keepalive
 in their connection headers.  I understand in haproxy v1.4 keepalives are ok
 from the client side but not from the server side, correct?
 
 So I added the option to http-server-close on our haproxy web service server
 and it appears to have stopped this random half loaded data stream issue.
 
 Can you explain how having keepalive coming from the server side application
 connection headers could cause this issue?
 
 Could you give a brief description of what happens when haproxy receives a
 keepalive header but does not have http-server-close option set?

OK so the client is not a browser, right ?

When you're using option httpclose only, haproxy just modifies the headers
to add close to the request and to the response, but does not perform any
active close. I observed in the distant past (8 years ago) that some servers
would not honnor the close and would expect the client to close after they
get the complete response. Then I added option forceclose to close the
server-side connection once the server starts to respond. By now we have a
much more complete message parser which knows where the end is and which
actively closes the connections as soon as you enable http-server-close or
http-keep-alive (which is not available on 1.4).

So what you found now is that your server ignores the close and expects
the client to close, while the client expects the same from the server.
Additionally it's possible that your client uses excess buffering and does
not get the whole response until it gets the real close (possibly the one
caused by haproxy's timeout).

Usually these situations are detectable in haproxy's logs because you get
very large total transfer times for a small server response time, and you
notice that all transfer times are very close to the client or server
timeout. Then you know that the connection remained open for that amount
of time.

Anyway, please use http-server-close or forceclose, it will do what you
need by enforcing the close. I hope you get a clearer picture now.

Regards,
Willy




Re: HAProxy Hang during initial connection

2014-05-18 Thread Willy Tarreau
Hi John,

On Thu, May 15, 2014 at 05:49:59PM +, JDzialo John wrote:
 Hi Guys,
 
 We have been using haproxy and have loved it and appreciate all the hard work
 you have put into this great product.
 
 I am new to the project and still trying to grasp it's complexities so
 forgive me in advance for any ignorance.
 
 We have been running haproxy infront of a web farm since January.  HAProxy
 1.4.24 on a Debian 7 server performing a round robin balance across 6 web
 servers using IIS 7.5 and hosting a .NET application.   All hosted in AWS.
 
 Recently we have started to see a strange issue arise.  Every once in a while
 a browser request or a web service call will hang until our 10 minute client
 timeout is hit and the request fails.
 
 Using fiddler for testing once in a while I see a request initiate but never
 make the connection to a back end server.  The request hangs and eventually
 fiddler reports a content length mismatch as our header declared a certain
 amount of data but the client only received a fraction of it.

So it sounds like your server is sending a wrong content-length.

 The issue is random but happens pretty consistently throughout the day.
 
 This just started a few weeks ago and there were no changes on our HAProxy
 config made since February.

Which would be consistent with a recent change on the server :-)

 Below find our configuration.
 
 # Global config
 global
 log 127.0.0.1 local0
 log 127.0.0.1 local1 notice
 maxconn 10
 stats socket /var/run/haproxy.sock mode 0600 level admin
 user haproxy
 group haproxy
 daemon
 
 # Default config
 defaults
 log global
 mode http
 option httplog
 option dontlognull
 option redispatch
 option forwardfor
 option httpclose
 option abortonclose
 retries 1
 timeout connect 5000
 timeout client 5
 timeout server 5
 
 listen stats
 #disabled
 bind *:
 stats enable
 stats uri /haproxy?stats
 stats realm Strictly\ Private
 stats auth :xx
 
 frontend unsecured *:80
 timeout client 60
 
 default_backend web
 
 backend web
 timeout server 60
 balance roundrobin
 
 server web1 xxx.xxx.xxx.xxx:80 check
 server web2 xxx.xxx.xxx.xxx:80 check
 server web3 xxx.xxx.xxx.xxx:80 check
 server web4 xxx.xxx.xxx.xxx:80 check
 server web5 xxx.xxx.xxx.xxx:80 check
 server web6 xxx.xxx.xxx.xxx:80 check
 
 Is there anything in our configuration that could cause this weird behavior?
 Or anything I could add?

No, haproxy will use content-length just like your client to know the
response length, but will not modify it.

 How about kernel settings in sysctl?  What are the optimal settings to run a
 haproxy server?

These are totally unrelated. Here you're having a problem with less data
being returned than advertised. So very likely your server is sending
some abnormal contents. Have you tried taking a network capture between
haproxy and the server to verify this ?

Note that another explanation could be totally unrelated to content length
and could simply be a bug causing haproxy to actually stop receiving or
emitting data, and closing the connection after the timeout. That would
also explain why your client receives less data than indicated in the
content-length header. But if so, you should be able to tell whether or
not the contents have been truncated.

However I'm not seeing any known bug looking like this in 1.4.24, so while
possible, this seems very strange since most people are using 1.4 :-/

 Any help you can give I would really appreciate it?
 
 Please let me know if there is anything else I can provide.

Please try to take a capture of the response from the server to haproxy so
that we find there if the response size is correct, and/or if any special
event happens. Please use something like this (assuming all your traffic
flows by interface eth0) :

 tcpdump -s0 -npi eth0 -w trace-server.cap src server-ip

If you have enough space on your disk, it can be even better to also capture
the haproxy-to-client traffic :

 tcpdump -s0 -npi eth0 -w trace-client.cap dst client-ip

Best regards,
Willy




HAProxy Hang during initial connection

2014-05-15 Thread JDzialo John
Hi Guys,

We have been using haproxy and have loved it and appreciate all the hard work 
you have put into this great product.

I am new to the project and still trying to grasp it's complexities so forgive 
me in advance for any ignorance.

We have been running haproxy infront of a web farm since January.  HAProxy 
1.4.24 on a Debian 7 server performing a round robin balance across 6 web 
servers using IIS 7.5 and hosting a .NET application.   All hosted in AWS.

Recently we have started to see a strange issue arise.  Every once in a while a 
browser request or a web service call will hang until our 10 minute client 
timeout is hit and the request fails.

Using fiddler for testing once in a while I see a request initiate but never 
make the connection to a back end server.  The request hangs and eventually 
fiddler reports a content length mismatch as our header declared a certain 
amount of data but the client only received a fraction of it.

The issue is random but happens pretty consistently throughout the day.

This just started a few weeks ago and there were no changes on our HAProxy 
config made since February.

Below find our configuration.

# Global config
global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
maxconn 10
stats socket /var/run/haproxy.sock mode 0600 level admin
user haproxy
group haproxy
daemon

# Default config
defaults
log global
mode http
option httplog
option dontlognull
option redispatch
option forwardfor
option httpclose
option abortonclose
retries 1
timeout connect 5000
timeout client 5
timeout server 5

listen stats
#disabled
bind *:
stats enable
stats uri /haproxy?stats
stats realm Strictly\ Private
stats auth :xx

frontend unsecured *:80
timeout client 60

default_backend web

backend web
timeout server 60
balance roundrobin

server web1 xxx.xxx.xxx.xxx:80 check
server web2 xxx.xxx.xxx.xxx:80 check
server web3 xxx.xxx.xxx.xxx:80 check
server web4 xxx.xxx.xxx.xxx:80 check
server web5 xxx.xxx.xxx.xxx:80 check
server web6 xxx.xxx.xxx.xxx:80 check

Is there anything in our configuration that could cause this weird behavior?  
Or anything I could add?

How about kernel settings in sysctl?  What are the optimal settings to run a 
haproxy server?

Any help you can give I would really appreciate it?

Please let me know if there is anything else I can provide.






John Dzialo | Linux System Administrator
Direct 203.783.8163 | Main 800.352.0050

Environmental Data Resources, Inc.
440 Wheelers Farms Road, Milford, CT 06461
www.edrnet.comhttp://www.edrnet.com/ | 
commonground.edrnet.comhttp://commonground.edrnet.com/

[Description: Description: Description: Description: Description: 
EDR_logo4color_EDR_only_80px2]