Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-02-24 Thread Baptiste
Hi Malcolm,

Hence the retry and redispatch options :)
I know it's a dirty workaround.

Baptiste


On Sun, Feb 23, 2014 at 8:42 PM, Malcolm Turnbull
malc...@loadbalancer.org wrote:
 Neil,

 Yes, peers are great for passing stick tables to the new HAProxy
 instance and any current connections bound to the old process will be
 fine.
 However  any new connections will hit the new HAProxy process and if
 the backend server is down but haproxy hasn't health checked it yet
 then the user will hit a failed server.



 On 23 February 2014 10:38, Neil n...@iamafreeman.com wrote:
 Hello

 Regarding restarts, rather that cold starts, if you configure peers the
 state from before the restart should be kept. The new process haproxy
 creates is automatically a peer to the existing process and gets the state
 as was.

 Neil

 On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote:




 
 From: Sok Ann Yap sok...@gmail.com
 Sent: 2014-02-21 05:11:48 E
 To: haproxy@formilux.org
 Subject: Re: Just a simple thought on health checks after a soft reload of
 HAProxy

 Patrick Hemmer haproxy@... writes:

   From: Willy Tarreau w at 1wt.eu

   Sent:  2014-01-25 05:45:11 E

 Till now that's exactly what's currently done. The servers are marked
 almost dead, so the first check gives the verdict. Initially we had
 all checks started immediately. But it caused a lot of issues at several
 places where there were a high number of backends or servers mapped to
 the same hardware, because the rush of connection really caused the
 servers to be flagged as down. So we started to spread the checks over
 the longest check period in a farm.

 Is there a way to enable this behavior? In my
 environment/configuration, it causes absolutely no issue that all
 the checks be fired off at the same time.
 As it is right now, when haproxy starts up, it takes it quite a
 while to discover which servers are down.
 -Patrick

 I faced the same problem in http://thread.gmane.org/
 gmane.comp.web.haproxy/14644

 After much contemplation, I decided to just patch away the initial spread
 check behavior: https://github.com/sayap/sayap-overlay/blob/master/net-
 proxy/haproxy/files/haproxy-immediate-first-check.diff



 I definitely think there should be an option to disable the behavior. We
 have an automated system which adds and removes servers from the config, and
 then bounces haproxy. Every time haproxy is bounced, we have a period where
 it can send traffic to a dead server.


 There's also a related bug on this.
 The bug is that when I have a config with inter 30s fastinter 1s and no
 httpchk enabled, when haproxy first starts up, it spreads the checks over
 the period defined as fastinter, but the stats output says UP 1/3 for the
 full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take
 the server 30 seconds to simply accept a connection.
 Yet you get different behavior when using httpchk. When I add option
 httpchk, it still spreads the checks over the 1s fastinter value, but the
 stats output goes full UP immediately after the check occurs, not UP
 1/3. It also says L7OK/200 in 0ms, which is what I expect to see.

 -Patrick





 --
 Regards,

 Malcolm Turnbull.

 Loadbalancer.org Ltd.
 Phone: +44 (0)870 443 8779
 http://www.loadbalancer.org/




Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-02-24 Thread Patrick Hemmer
Unfortunately retry doesn't work in our case as we run haproxy on 2
layers, frontend servers and backend servers (to distribute traffic
among multiple processes on each server). So when an app on a server
goes down, the haproxy on that server is still up and accepting
connections, but the layer 7 http checks from the frontend haproxy are
failing. But since the backend haproxy is still accepting connections,
the retry option does not work.

-Patrick


*From: *Baptiste bed...@gmail.com
*Sent: * 2014-02-24 07:18:00 E
*To: *Malcolm Turnbull malc...@loadbalancer.org
*CC: *Neil n...@iamafreeman.com, Patrick Hemmer
hapr...@stormcloud9.net, HAProxy haproxy@formilux.org
*Subject: *Re: Just a simple thought on health checks after a soft
reload of HAProxy

 Hi Malcolm,

 Hence the retry and redispatch options :)
 I know it's a dirty workaround.

 Baptiste


 On Sun, Feb 23, 2014 at 8:42 PM, Malcolm Turnbull
 malc...@loadbalancer.org wrote:
 Neil,

 Yes, peers are great for passing stick tables to the new HAProxy
 instance and any current connections bound to the old process will be
 fine.
 However  any new connections will hit the new HAProxy process and if
 the backend server is down but haproxy hasn't health checked it yet
 then the user will hit a failed server.



 On 23 February 2014 10:38, Neil n...@iamafreeman.com wrote:
 Hello

 Regarding restarts, rather that cold starts, if you configure peers the
 state from before the restart should be kept. The new process haproxy
 creates is automatically a peer to the existing process and gets the state
 as was.

 Neil

 On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote:



 
 From: Sok Ann Yap sok...@gmail.com
 Sent: 2014-02-21 05:11:48 E
 To: haproxy@formilux.org
 Subject: Re: Just a simple thought on health checks after a soft reload of
 HAProxy

 Patrick Hemmer haproxy@... writes:

   From: Willy Tarreau w at 1wt.eu

   Sent:  2014-01-25 05:45:11 E

 Till now that's exactly what's currently done. The servers are marked
 almost dead, so the first check gives the verdict. Initially we had
 all checks started immediately. But it caused a lot of issues at several
 places where there were a high number of backends or servers mapped to
 the same hardware, because the rush of connection really caused the
 servers to be flagged as down. So we started to spread the checks over
 the longest check period in a farm.

 Is there a way to enable this behavior? In my
 environment/configuration, it causes absolutely no issue that all
 the checks be fired off at the same time.
 As it is right now, when haproxy starts up, it takes it quite a
 while to discover which servers are down.
 -Patrick

 I faced the same problem in http://thread.gmane.org/
 gmane.comp.web.haproxy/14644

 After much contemplation, I decided to just patch away the initial spread
 check behavior: https://github.com/sayap/sayap-overlay/blob/master/net-
 proxy/haproxy/files/haproxy-immediate-first-check.diff



 I definitely think there should be an option to disable the behavior. We
 have an automated system which adds and removes servers from the config, 
 and
 then bounces haproxy. Every time haproxy is bounced, we have a period where
 it can send traffic to a dead server.


 There's also a related bug on this.
 The bug is that when I have a config with inter 30s fastinter 1s and no
 httpchk enabled, when haproxy first starts up, it spreads the checks over
 the period defined as fastinter, but the stats output says UP 1/3 for the
 full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't 
 take
 the server 30 seconds to simply accept a connection.
 Yet you get different behavior when using httpchk. When I add option
 httpchk, it still spreads the checks over the 1s fastinter value, but the
 stats output goes full UP immediately after the check occurs, not UP
 1/3. It also says L7OK/200 in 0ms, which is what I expect to see.

 -Patrick



 --
 Regards,

 Malcolm Turnbull.

 Loadbalancer.org Ltd.
 Phone: +44 (0)870 443 8779
 http://www.loadbalancer.org/




Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-02-23 Thread Neil
Hello

Regarding restarts, rather that cold starts, if you configure peers the
state from before the restart should be kept. The new process haproxy
creates is automatically a peer to the existing process and gets the state
as was.

Neil
On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote:




 --
 *From: *Sok Ann Yap sok...@gmail.com sok...@gmail.com
 *Sent: * 2014-02-21 05:11:48 E
 *To: *haproxy@formilux.org
 *Subject: *Re: Just a simple thought on health checks after a soft reload
 of HAProxy

  Patrick Hemmer haproxy@... haproxy@... writes:


From: Willy Tarreau w at 1wt.eu

   Sent:  2014-01-25 05:45:11 E

 Till now that's exactly what's currently done. The servers are marked
 almost dead, so the first check gives the verdict. Initially we had
 all checks started immediately. But it caused a lot of issues at several
 places where there were a high number of backends or servers mapped to
 the same hardware, because the rush of connection really caused the
 servers to be flagged as down. So we started to spread the checks over
 the longest check period in a farm.

 Is there a way to enable this behavior? In my
 environment/configuration, it causes absolutely no issue that all
 the checks be fired off at the same time.
 As it is right now, when haproxy starts up, it takes it quite a
 while to discover which servers are down.
 -Patrick



 I faced the same problem in http://thread.gmane.org/
 gmane.comp.web.haproxy/14644

 After much contemplation, I decided to just patch away the initial spread
 check behavior: https://github.com/sayap/sayap-overlay/blob/master/net-
 proxy/haproxy/files/haproxy-immediate-first-check.diff




 I definitely think there should be an option to disable the behavior. We
 have an automated system which adds and removes servers from the config,
 and then bounces haproxy. Every time haproxy is bounced, we have a period
 where it can send traffic to a dead server.


 There's also a related bug on this.
 The bug is that when I have a config with inter 30s fastinter 1s and no
 httpchk enabled, when haproxy first starts up, it spreads the checks over
 the period defined as fastinter, but the stats output says UP 1/3 for the
 full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't
 take the server 30 seconds to simply accept a connection.
 Yet you get different behavior when using httpchk. When I add option
 httpchk, it still spreads the checks over the 1s fastinter value, but the
 stats output goes full UP immediately after the check occurs, not UP
 1/3. It also says L7OK/200 in 0ms, which is what I expect to see.

 -Patrick




Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-02-23 Thread Malcolm Turnbull
Neil,

Yes, peers are great for passing stick tables to the new HAProxy
instance and any current connections bound to the old process will be
fine.
However  any new connections will hit the new HAProxy process and if
the backend server is down but haproxy hasn't health checked it yet
then the user will hit a failed server.



On 23 February 2014 10:38, Neil n...@iamafreeman.com wrote:
 Hello

 Regarding restarts, rather that cold starts, if you configure peers the
 state from before the restart should be kept. The new process haproxy
 creates is automatically a peer to the existing process and gets the state
 as was.

 Neil

 On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote:




 
 From: Sok Ann Yap sok...@gmail.com
 Sent: 2014-02-21 05:11:48 E
 To: haproxy@formilux.org
 Subject: Re: Just a simple thought on health checks after a soft reload of
 HAProxy

 Patrick Hemmer haproxy@... writes:

   From: Willy Tarreau w at 1wt.eu

   Sent:  2014-01-25 05:45:11 E

 Till now that's exactly what's currently done. The servers are marked
 almost dead, so the first check gives the verdict. Initially we had
 all checks started immediately. But it caused a lot of issues at several
 places where there were a high number of backends or servers mapped to
 the same hardware, because the rush of connection really caused the
 servers to be flagged as down. So we started to spread the checks over
 the longest check period in a farm.

 Is there a way to enable this behavior? In my
 environment/configuration, it causes absolutely no issue that all
 the checks be fired off at the same time.
 As it is right now, when haproxy starts up, it takes it quite a
 while to discover which servers are down.
 -Patrick

 I faced the same problem in http://thread.gmane.org/
 gmane.comp.web.haproxy/14644

 After much contemplation, I decided to just patch away the initial spread
 check behavior: https://github.com/sayap/sayap-overlay/blob/master/net-
 proxy/haproxy/files/haproxy-immediate-first-check.diff



 I definitely think there should be an option to disable the behavior. We
 have an automated system which adds and removes servers from the config, and
 then bounces haproxy. Every time haproxy is bounced, we have a period where
 it can send traffic to a dead server.


 There's also a related bug on this.
 The bug is that when I have a config with inter 30s fastinter 1s and no
 httpchk enabled, when haproxy first starts up, it spreads the checks over
 the period defined as fastinter, but the stats output says UP 1/3 for the
 full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take
 the server 30 seconds to simply accept a connection.
 Yet you get different behavior when using httpchk. When I add option
 httpchk, it still spreads the checks over the 1s fastinter value, but the
 stats output goes full UP immediately after the check occurs, not UP
 1/3. It also says L7OK/200 in 0ms, which is what I expect to see.

 -Patrick





-- 
Regards,

Malcolm Turnbull.

Loadbalancer.org Ltd.
Phone: +44 (0)870 443 8779
http://www.loadbalancer.org/



Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-02-22 Thread Patrick Hemmer
 



*From: *Sok Ann Yap sok...@gmail.com
*Sent: * 2014-02-21 05:11:48 E
*To: *haproxy@formilux.org
*Subject: *Re: Just a simple thought on health checks after a soft
reload of HAProxy

 Patrick Hemmer haproxy@... writes:

   From: Willy Tarreau w at 1wt.eu

   Sent:  2014-01-25 05:45:11 E

 Till now that's exactly what's currently done. The servers are marked
 almost dead, so the first check gives the verdict. Initially we had
 all checks started immediately. But it caused a lot of issues at several
 places where there were a high number of backends or servers mapped to
 the same hardware, because the rush of connection really caused the
 servers to be flagged as down. So we started to spread the checks over
 the longest check period in a farm. 

 Is there a way to enable this behavior? In my
 environment/configuration, it causes absolutely no issue that all
 the checks be fired off at the same time.
 As it is right now, when haproxy starts up, it takes it quite a
 while to discover which servers are down.
 -Patrick

 I faced the same problem in http://thread.gmane.org/
 gmane.comp.web.haproxy/14644

 After much contemplation, I decided to just patch away the initial spread 
 check behavior: https://github.com/sayap/sayap-overlay/blob/master/net-
 proxy/haproxy/files/haproxy-immediate-first-check.diff



I definitely think there should be an option to disable the behavior. We
have an automated system which adds and removes servers from the config,
and then bounces haproxy. Every time haproxy is bounced, we have a
period where it can send traffic to a dead server.


There's also a related bug on this.
The bug is that when I have a config with inter 30s fastinter 1s and
no httpchk enabled, when haproxy first starts up, it spreads the checks
over the period defined as fastinter, but the stats output says UP 1/3
for the full 30 seconds. It also says L4OK in 30001ms, when I know it
doesn't take the server 30 seconds to simply accept a connection.
Yet you get different behavior when using httpchk. When I add option
httpchk, it still spreads the checks over the 1s fastinter value, but
the stats output goes full UP immediately after the check occurs, not
UP 1/3. It also says L7OK/200 in 0ms, which is what I expect to see.

-Patrick



Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-01-28 Thread Kevin Burke
This is also an issue for us (see my post from a few days ago) - on
HAProxy's first start, most hosts are marked DOWN with a Layer4 timeout,
even though they are fine, because there are a large number of them.

Some workaround or more forgiving initial health check would be useful here.



Kevin Burke | 415-723-4116 | www.twilio.com


On Tue, Jan 28, 2014 at 8:13 AM, Patrick Hemmer hapr...@stormcloud9.netwrote:

  *From: *Willy Tarreau w...@1wt.eu w...@1wt.eu
 *Sent: * 2014-01-25 05:45:11 E
 *To: *Patrick Hemmer hapr...@stormcloud9.net hapr...@stormcloud9.net
 *CC: *Malcolm Turnbull malc...@loadbalancer.orgmalc...@loadbalancer.org,
 haproxy@formilux.org haproxy@formilux.org haproxy@formilux.org
 *Subject: *Re: Just a simple thought on health checks after a soft reload
 of HAProxy

  On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote:

  Personally I would not like that every server is considered down until
 after the health checks pass. Basically this would result in things
 being down after a reload, which defeats the point of the reload being
 non-interruptive.

  I can confirm, we had this in a very early version, something like 1.0.x
 and it was quickly changed! I've been using Alteon load balancers for
 years and their health checks are slow. I remember that the persons in
 charge for them were always scared to reboot them because the services
 remained down for a long time after a reboot (seconds to minutes). So
 we definitely don't want this to happen here.


  I can think of 2 possible solutions:
 1) When the new process comes up, do an initial check on all servers
 (just one) which have checks enabled. Use that one check as the verdict
 for whether each server should be marked 'up' or 'down'.

  Till now that's exactly what's currently done. The servers are marked
 almost dead, so the first check gives the verdict. Initially we had
 all checks started immediately. But it caused a lot of issues at several
 places where there were a high number of backends or servers mapped to
 the same hardware, because the rush of connection really caused the
 servers to be flagged as down. So we started to spread the checks over
 the longest check period in a farm.


 Is there a way to enable this behavior? In my environment/configuration,
 it causes absolutely no issue that all the checks be fired off at the same
 time.
 As it is right now, when haproxy starts up, it takes it quite a while to
 discover which servers are down.

 -Patrick



Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-01-25 Thread Willy Tarreau
On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote:
 Personally I would not like that every server is considered down until
 after the health checks pass. Basically this would result in things
 being down after a reload, which defeats the point of the reload being
 non-interruptive.

I can confirm, we had this in a very early version, something like 1.0.x
and it was quickly changed! I've been using Alteon load balancers for
years and their health checks are slow. I remember that the persons in
charge for them were always scared to reboot them because the services
remained down for a long time after a reboot (seconds to minutes). So
we definitely don't want this to happen here.

 I can think of 2 possible solutions:
 1) When the new process comes up, do an initial check on all servers
 (just one) which have checks enabled. Use that one check as the verdict
 for whether each server should be marked 'up' or 'down'.

Till now that's exactly what's currently done. The servers are marked
almost dead, so the first check gives the verdict. Initially we had
all checks started immediately. But it caused a lot of issues at several
places where there were a high number of backends or servers mapped to
the same hardware, because the rush of connection really caused the
servers to be flagged as down. So we started to spread the checks over
the longest check period in a farm.

 After each
 server has been checked once, then signal the other process to shut down
 and start listening.

It is not really possible unfortunately, because we have to bind before
the fork (before losing privileges), and the poll loop cannot be used
before the fork.

 2) Use the stats socket (if enabled) to pull the stats from the previous
 process. Use its health check data to pre-populate the health data of
 the new process. This one has a few drawbacks though. The server 
 backend names must match between the old and new config, and the stats
 socket has to be enabled. It would probably be harder to code as well,
 but I really don't know on that.

There was an old thread many years ago on this list where a somewhat
similar solution was proposed, which was quite simple but nobody worked
on it. The idea was to dump the servers status from the shutdown script
to a file upon reload, and to pass that file to the new process so that
it could parse it and find the relevant information there.

I must say I liked the principle because it could also be used as a
configuration trick to force certain servers' states at boot without
touching the configuration file for example.

I think it can easily be done for basic purposes. The issue is always
with adding/removing/renaming servers.

Right now the official server identifier is its numeric ID which can
be forced (useful for APIs and SNMP) or automatically assigned. Peers
use these IDs for state table synchronization for example. Ideally,
upon a reload, we should consider that IDs are used if they're forced,
otherwise names are used. That would cover only addition/removal when
IDs are not set, and renaming as well when IDs are set. And this works
for frontend and backends as well. Currently we don't have the
information saying that an ID was manually assigned, but it is a very
minor detail to add!

Willy




Re: Just a simple thought on health checks after a soft reload of HAProxy....

2014-01-21 Thread Patrick Hemmer

*From: *Malcolm Turnbull malc...@loadbalancer.org
*Sent: * 2014-01-14 07:13:27 E
*To: *haproxy@formilux.org haproxy@formilux.org
*Subject: *Just a simple thought on health checks after a soft reload of
HAProxy

 Just a simple though on health checks after a soft reload of HAProxy

 If for example you had several backend servers one of which had crashed...
 Then you make make a configuration change to HAProxy and soft reload,
 for instance adding a new backend server.

 All the servers are instantly brought up and available for traffic
 (including the crashed one).
 So traffic will possibly be sent to a broken server...

 Obviously its only a small problem as it is fixed as soon as the
 health check actually runs...

 But I was just wondering is their a way of saying don't bring up a
 server until it passes a health check?
I was just thinking of this issue myself and google turned up your post.
Personally I would not like that every server is considered down until
after the health checks pass. Basically this would result in things
being down after a reload, which defeats the point of the reload being
non-interruptive.

I can think of 2 possible solutions:
1) When the new process comes up, do an initial check on all servers
(just one) which have checks enabled. Use that one check as the verdict
for whether each server should be marked 'up' or 'down'. After each
server has been checked once, then signal the other process to shut down
and start listening.
2) Use the stats socket (if enabled) to pull the stats from the previous
process. Use its health check data to pre-populate the health data of
the new process. This one has a few drawbacks though. The server 
backend names must match between the old and new config, and the stats
socket has to be enabled. It would probably be harder to code as well,
but I really don't know on that.

-Patrick


Just a simple thought on health checks after a soft reload of HAProxy....

2014-01-14 Thread Malcolm Turnbull
Just a simple though on health checks after a soft reload of HAProxy

If for example you had several backend servers one of which had crashed...
Then you make make a configuration change to HAProxy and soft reload,
for instance adding a new backend server.

All the servers are instantly brought up and available for traffic
(including the crashed one).
So traffic will possibly be sent to a broken server...

Obviously its only a small problem as it is fixed as soon as the
health check actually runs...

But I was just wondering is their a way of saying don't bring up a
server until it passes a health check?



-- 
Regards,

Malcolm Turnbull.

Loadbalancer.org Ltd.
Phone: +44 (0)870 443 8779
http://www.loadbalancer.org/