Re: Just a simple thought on health checks after a soft reload of HAProxy....
Hi Malcolm, Hence the retry and redispatch options :) I know it's a dirty workaround. Baptiste On Sun, Feb 23, 2014 at 8:42 PM, Malcolm Turnbull malc...@loadbalancer.org wrote: Neil, Yes, peers are great for passing stick tables to the new HAProxy instance and any current connections bound to the old process will be fine. However any new connections will hit the new HAProxy process and if the backend server is down but haproxy hasn't health checked it yet then the user will hit a failed server. On 23 February 2014 10:38, Neil n...@iamafreeman.com wrote: Hello Regarding restarts, rather that cold starts, if you configure peers the state from before the restart should be kept. The new process haproxy creates is automatically a peer to the existing process and gets the state as was. Neil On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote: From: Sok Ann Yap sok...@gmail.com Sent: 2014-02-21 05:11:48 E To: haproxy@formilux.org Subject: Re: Just a simple thought on health checks after a soft reload of HAProxy Patrick Hemmer haproxy@... writes: From: Willy Tarreau w at 1wt.eu Sent: 2014-01-25 05:45:11 E Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. Is there a way to enable this behavior? In my environment/configuration, it causes absolutely no issue that all the checks be fired off at the same time. As it is right now, when haproxy starts up, it takes it quite a while to discover which servers are down. -Patrick I faced the same problem in http://thread.gmane.org/ gmane.comp.web.haproxy/14644 After much contemplation, I decided to just patch away the initial spread check behavior: https://github.com/sayap/sayap-overlay/blob/master/net- proxy/haproxy/files/haproxy-immediate-first-check.diff I definitely think there should be an option to disable the behavior. We have an automated system which adds and removes servers from the config, and then bounces haproxy. Every time haproxy is bounced, we have a period where it can send traffic to a dead server. There's also a related bug on this. The bug is that when I have a config with inter 30s fastinter 1s and no httpchk enabled, when haproxy first starts up, it spreads the checks over the period defined as fastinter, but the stats output says UP 1/3 for the full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take the server 30 seconds to simply accept a connection. Yet you get different behavior when using httpchk. When I add option httpchk, it still spreads the checks over the 1s fastinter value, but the stats output goes full UP immediately after the check occurs, not UP 1/3. It also says L7OK/200 in 0ms, which is what I expect to see. -Patrick -- Regards, Malcolm Turnbull. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Just a simple thought on health checks after a soft reload of HAProxy....
Unfortunately retry doesn't work in our case as we run haproxy on 2 layers, frontend servers and backend servers (to distribute traffic among multiple processes on each server). So when an app on a server goes down, the haproxy on that server is still up and accepting connections, but the layer 7 http checks from the frontend haproxy are failing. But since the backend haproxy is still accepting connections, the retry option does not work. -Patrick *From: *Baptiste bed...@gmail.com *Sent: * 2014-02-24 07:18:00 E *To: *Malcolm Turnbull malc...@loadbalancer.org *CC: *Neil n...@iamafreeman.com, Patrick Hemmer hapr...@stormcloud9.net, HAProxy haproxy@formilux.org *Subject: *Re: Just a simple thought on health checks after a soft reload of HAProxy Hi Malcolm, Hence the retry and redispatch options :) I know it's a dirty workaround. Baptiste On Sun, Feb 23, 2014 at 8:42 PM, Malcolm Turnbull malc...@loadbalancer.org wrote: Neil, Yes, peers are great for passing stick tables to the new HAProxy instance and any current connections bound to the old process will be fine. However any new connections will hit the new HAProxy process and if the backend server is down but haproxy hasn't health checked it yet then the user will hit a failed server. On 23 February 2014 10:38, Neil n...@iamafreeman.com wrote: Hello Regarding restarts, rather that cold starts, if you configure peers the state from before the restart should be kept. The new process haproxy creates is automatically a peer to the existing process and gets the state as was. Neil On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote: From: Sok Ann Yap sok...@gmail.com Sent: 2014-02-21 05:11:48 E To: haproxy@formilux.org Subject: Re: Just a simple thought on health checks after a soft reload of HAProxy Patrick Hemmer haproxy@... writes: From: Willy Tarreau w at 1wt.eu Sent: 2014-01-25 05:45:11 E Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. Is there a way to enable this behavior? In my environment/configuration, it causes absolutely no issue that all the checks be fired off at the same time. As it is right now, when haproxy starts up, it takes it quite a while to discover which servers are down. -Patrick I faced the same problem in http://thread.gmane.org/ gmane.comp.web.haproxy/14644 After much contemplation, I decided to just patch away the initial spread check behavior: https://github.com/sayap/sayap-overlay/blob/master/net- proxy/haproxy/files/haproxy-immediate-first-check.diff I definitely think there should be an option to disable the behavior. We have an automated system which adds and removes servers from the config, and then bounces haproxy. Every time haproxy is bounced, we have a period where it can send traffic to a dead server. There's also a related bug on this. The bug is that when I have a config with inter 30s fastinter 1s and no httpchk enabled, when haproxy first starts up, it spreads the checks over the period defined as fastinter, but the stats output says UP 1/3 for the full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take the server 30 seconds to simply accept a connection. Yet you get different behavior when using httpchk. When I add option httpchk, it still spreads the checks over the 1s fastinter value, but the stats output goes full UP immediately after the check occurs, not UP 1/3. It also says L7OK/200 in 0ms, which is what I expect to see. -Patrick -- Regards, Malcolm Turnbull. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Just a simple thought on health checks after a soft reload of HAProxy....
Hello Regarding restarts, rather that cold starts, if you configure peers the state from before the restart should be kept. The new process haproxy creates is automatically a peer to the existing process and gets the state as was. Neil On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote: -- *From: *Sok Ann Yap sok...@gmail.com sok...@gmail.com *Sent: * 2014-02-21 05:11:48 E *To: *haproxy@formilux.org *Subject: *Re: Just a simple thought on health checks after a soft reload of HAProxy Patrick Hemmer haproxy@... haproxy@... writes: From: Willy Tarreau w at 1wt.eu Sent: 2014-01-25 05:45:11 E Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. Is there a way to enable this behavior? In my environment/configuration, it causes absolutely no issue that all the checks be fired off at the same time. As it is right now, when haproxy starts up, it takes it quite a while to discover which servers are down. -Patrick I faced the same problem in http://thread.gmane.org/ gmane.comp.web.haproxy/14644 After much contemplation, I decided to just patch away the initial spread check behavior: https://github.com/sayap/sayap-overlay/blob/master/net- proxy/haproxy/files/haproxy-immediate-first-check.diff I definitely think there should be an option to disable the behavior. We have an automated system which adds and removes servers from the config, and then bounces haproxy. Every time haproxy is bounced, we have a period where it can send traffic to a dead server. There's also a related bug on this. The bug is that when I have a config with inter 30s fastinter 1s and no httpchk enabled, when haproxy first starts up, it spreads the checks over the period defined as fastinter, but the stats output says UP 1/3 for the full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take the server 30 seconds to simply accept a connection. Yet you get different behavior when using httpchk. When I add option httpchk, it still spreads the checks over the 1s fastinter value, but the stats output goes full UP immediately after the check occurs, not UP 1/3. It also says L7OK/200 in 0ms, which is what I expect to see. -Patrick
Re: Just a simple thought on health checks after a soft reload of HAProxy....
Neil, Yes, peers are great for passing stick tables to the new HAProxy instance and any current connections bound to the old process will be fine. However any new connections will hit the new HAProxy process and if the backend server is down but haproxy hasn't health checked it yet then the user will hit a failed server. On 23 February 2014 10:38, Neil n...@iamafreeman.com wrote: Hello Regarding restarts, rather that cold starts, if you configure peers the state from before the restart should be kept. The new process haproxy creates is automatically a peer to the existing process and gets the state as was. Neil On 23 Feb 2014 03:46, Patrick Hemmer hapr...@stormcloud9.net wrote: From: Sok Ann Yap sok...@gmail.com Sent: 2014-02-21 05:11:48 E To: haproxy@formilux.org Subject: Re: Just a simple thought on health checks after a soft reload of HAProxy Patrick Hemmer haproxy@... writes: From: Willy Tarreau w at 1wt.eu Sent: 2014-01-25 05:45:11 E Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. Is there a way to enable this behavior? In my environment/configuration, it causes absolutely no issue that all the checks be fired off at the same time. As it is right now, when haproxy starts up, it takes it quite a while to discover which servers are down. -Patrick I faced the same problem in http://thread.gmane.org/ gmane.comp.web.haproxy/14644 After much contemplation, I decided to just patch away the initial spread check behavior: https://github.com/sayap/sayap-overlay/blob/master/net- proxy/haproxy/files/haproxy-immediate-first-check.diff I definitely think there should be an option to disable the behavior. We have an automated system which adds and removes servers from the config, and then bounces haproxy. Every time haproxy is bounced, we have a period where it can send traffic to a dead server. There's also a related bug on this. The bug is that when I have a config with inter 30s fastinter 1s and no httpchk enabled, when haproxy first starts up, it spreads the checks over the period defined as fastinter, but the stats output says UP 1/3 for the full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take the server 30 seconds to simply accept a connection. Yet you get different behavior when using httpchk. When I add option httpchk, it still spreads the checks over the 1s fastinter value, but the stats output goes full UP immediately after the check occurs, not UP 1/3. It also says L7OK/200 in 0ms, which is what I expect to see. -Patrick -- Regards, Malcolm Turnbull. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Just a simple thought on health checks after a soft reload of HAProxy....
*From: *Sok Ann Yap sok...@gmail.com *Sent: * 2014-02-21 05:11:48 E *To: *haproxy@formilux.org *Subject: *Re: Just a simple thought on health checks after a soft reload of HAProxy Patrick Hemmer haproxy@... writes: From: Willy Tarreau w at 1wt.eu Sent: 2014-01-25 05:45:11 E Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. Is there a way to enable this behavior? In my environment/configuration, it causes absolutely no issue that all the checks be fired off at the same time. As it is right now, when haproxy starts up, it takes it quite a while to discover which servers are down. -Patrick I faced the same problem in http://thread.gmane.org/ gmane.comp.web.haproxy/14644 After much contemplation, I decided to just patch away the initial spread check behavior: https://github.com/sayap/sayap-overlay/blob/master/net- proxy/haproxy/files/haproxy-immediate-first-check.diff I definitely think there should be an option to disable the behavior. We have an automated system which adds and removes servers from the config, and then bounces haproxy. Every time haproxy is bounced, we have a period where it can send traffic to a dead server. There's also a related bug on this. The bug is that when I have a config with inter 30s fastinter 1s and no httpchk enabled, when haproxy first starts up, it spreads the checks over the period defined as fastinter, but the stats output says UP 1/3 for the full 30 seconds. It also says L4OK in 30001ms, when I know it doesn't take the server 30 seconds to simply accept a connection. Yet you get different behavior when using httpchk. When I add option httpchk, it still spreads the checks over the 1s fastinter value, but the stats output goes full UP immediately after the check occurs, not UP 1/3. It also says L7OK/200 in 0ms, which is what I expect to see. -Patrick
Re: Just a simple thought on health checks after a soft reload of HAProxy....
This is also an issue for us (see my post from a few days ago) - on HAProxy's first start, most hosts are marked DOWN with a Layer4 timeout, even though they are fine, because there are a large number of them. Some workaround or more forgiving initial health check would be useful here. Kevin Burke | 415-723-4116 | www.twilio.com On Tue, Jan 28, 2014 at 8:13 AM, Patrick Hemmer hapr...@stormcloud9.netwrote: *From: *Willy Tarreau w...@1wt.eu w...@1wt.eu *Sent: * 2014-01-25 05:45:11 E *To: *Patrick Hemmer hapr...@stormcloud9.net hapr...@stormcloud9.net *CC: *Malcolm Turnbull malc...@loadbalancer.orgmalc...@loadbalancer.org, haproxy@formilux.org haproxy@formilux.org haproxy@formilux.org *Subject: *Re: Just a simple thought on health checks after a soft reload of HAProxy On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote: Personally I would not like that every server is considered down until after the health checks pass. Basically this would result in things being down after a reload, which defeats the point of the reload being non-interruptive. I can confirm, we had this in a very early version, something like 1.0.x and it was quickly changed! I've been using Alteon load balancers for years and their health checks are slow. I remember that the persons in charge for them were always scared to reboot them because the services remained down for a long time after a reboot (seconds to minutes). So we definitely don't want this to happen here. I can think of 2 possible solutions: 1) When the new process comes up, do an initial check on all servers (just one) which have checks enabled. Use that one check as the verdict for whether each server should be marked 'up' or 'down'. Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. Is there a way to enable this behavior? In my environment/configuration, it causes absolutely no issue that all the checks be fired off at the same time. As it is right now, when haproxy starts up, it takes it quite a while to discover which servers are down. -Patrick
Re: Just a simple thought on health checks after a soft reload of HAProxy....
On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote: Personally I would not like that every server is considered down until after the health checks pass. Basically this would result in things being down after a reload, which defeats the point of the reload being non-interruptive. I can confirm, we had this in a very early version, something like 1.0.x and it was quickly changed! I've been using Alteon load balancers for years and their health checks are slow. I remember that the persons in charge for them were always scared to reboot them because the services remained down for a long time after a reboot (seconds to minutes). So we definitely don't want this to happen here. I can think of 2 possible solutions: 1) When the new process comes up, do an initial check on all servers (just one) which have checks enabled. Use that one check as the verdict for whether each server should be marked 'up' or 'down'. Till now that's exactly what's currently done. The servers are marked almost dead, so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. After each server has been checked once, then signal the other process to shut down and start listening. It is not really possible unfortunately, because we have to bind before the fork (before losing privileges), and the poll loop cannot be used before the fork. 2) Use the stats socket (if enabled) to pull the stats from the previous process. Use its health check data to pre-populate the health data of the new process. This one has a few drawbacks though. The server backend names must match between the old and new config, and the stats socket has to be enabled. It would probably be harder to code as well, but I really don't know on that. There was an old thread many years ago on this list where a somewhat similar solution was proposed, which was quite simple but nobody worked on it. The idea was to dump the servers status from the shutdown script to a file upon reload, and to pass that file to the new process so that it could parse it and find the relevant information there. I must say I liked the principle because it could also be used as a configuration trick to force certain servers' states at boot without touching the configuration file for example. I think it can easily be done for basic purposes. The issue is always with adding/removing/renaming servers. Right now the official server identifier is its numeric ID which can be forced (useful for APIs and SNMP) or automatically assigned. Peers use these IDs for state table synchronization for example. Ideally, upon a reload, we should consider that IDs are used if they're forced, otherwise names are used. That would cover only addition/removal when IDs are not set, and renaming as well when IDs are set. And this works for frontend and backends as well. Currently we don't have the information saying that an ID was manually assigned, but it is a very minor detail to add! Willy
Re: Just a simple thought on health checks after a soft reload of HAProxy....
*From: *Malcolm Turnbull malc...@loadbalancer.org *Sent: * 2014-01-14 07:13:27 E *To: *haproxy@formilux.org haproxy@formilux.org *Subject: *Just a simple thought on health checks after a soft reload of HAProxy Just a simple though on health checks after a soft reload of HAProxy If for example you had several backend servers one of which had crashed... Then you make make a configuration change to HAProxy and soft reload, for instance adding a new backend server. All the servers are instantly brought up and available for traffic (including the crashed one). So traffic will possibly be sent to a broken server... Obviously its only a small problem as it is fixed as soon as the health check actually runs... But I was just wondering is their a way of saying don't bring up a server until it passes a health check? I was just thinking of this issue myself and google turned up your post. Personally I would not like that every server is considered down until after the health checks pass. Basically this would result in things being down after a reload, which defeats the point of the reload being non-interruptive. I can think of 2 possible solutions: 1) When the new process comes up, do an initial check on all servers (just one) which have checks enabled. Use that one check as the verdict for whether each server should be marked 'up' or 'down'. After each server has been checked once, then signal the other process to shut down and start listening. 2) Use the stats socket (if enabled) to pull the stats from the previous process. Use its health check data to pre-populate the health data of the new process. This one has a few drawbacks though. The server backend names must match between the old and new config, and the stats socket has to be enabled. It would probably be harder to code as well, but I really don't know on that. -Patrick
Just a simple thought on health checks after a soft reload of HAProxy....
Just a simple though on health checks after a soft reload of HAProxy If for example you had several backend servers one of which had crashed... Then you make make a configuration change to HAProxy and soft reload, for instance adding a new backend server. All the servers are instantly brought up and available for traffic (including the crashed one). So traffic will possibly be sent to a broken server... Obviously its only a small problem as it is fixed as soon as the health check actually runs... But I was just wondering is their a way of saying don't bring up a server until it passes a health check? -- Regards, Malcolm Turnbull. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/