Re: queued health checks?
On Sun, Mar 21, 2010 at 12:05:16AM -0400, Greg Gard wrote: thanks holger, i did some research and was able to find more on mongrel and queuing. so that helps to clarify. i am unsure what i will do viz checking in the end as we have some long running requests that are frankly the bane of my existence and complicate load balancing. we need to refactor as part of the solution. just to be complete, are there any plans to have health checks get queued? We'll see how we can do that when checks are reworked, but quite frankly, mongrel users are the *only* ones who need that, and when you have one server which can take one minute to respond to a single request, you have far more important trouble to worry about than if health checks will be queued or not ! If a server can only do one thing at a time, you must design it to do it extremely fast. In your case, someone could ruin your day by just sending a few repeated clicks to your server and feed it some work for one full day... There's obviously something wrong ! As Holger indicated it, mongrel can queue requests, so if your server had occasionally long response times of one SECOND, the check would just transparently be queued and processed. But at some point, infrastructure elements can't work around bad code, and nothing but fixing the code will make your users accept to wait for the response. Just imagine if someone posted a link to your site on a forum or any regularly visited site. Your site would then be permanently down... As a user, when I see that a site does not respond within 5-7 seconds, I first check my internet connectivity. After that, I declare the site dead and go somewhere else, which is especially true with online stores. You don't even know if *any* of your users have ever waited for your site to respond to the 60+ sec requests. Also you should consider the cost of fixing the code versus paying the electricity bill... Assuming your server consumes 400W, at 60s it consumes 24 kJ per click !!! The equivalent of a 60W light bulb for 7 minutes. For 2 clicks which will take 1 second to haproxy, you'll get work for 2 full weeks on your server or about 133 kWh !!! SO YES, THERE IS SOMETHING DEFINITELY WRONG IN HAVING A WEB SERVER TAKE 60+ SEC TO PROCESS ONE REQUEST. And excuse me for being so crude, but if you don't fix that, your site is deemed to fail long before it gets minimal audience. In the mean time, the only thing I can suggest you is to use very large check timeouts (larger than the longer supposedly valid request), with a low retry count to avoid taking too much time to declare a server down, and probably make use of the observe, on-error and error-limit server options to be able to set your server down as soon as a 5xx response is returned to a client. Hoping this helps, Willy
Re: queued health checks?
yes it does willy. thanks. and i share your exasperation with our situation. rails/mongrel is a pain AND we have tons of very slow/cpu intensive legacy asp reporting code that is mission critical. we are trying to undo the sins of our past (namely taking microsoft's word for it...sorry any .net fans out there). anyway, we love haproxy and your suggestions over the past few months have worked out for the better. so i appreciate it despite feeling a bit sheepish after reading your comments. :) looking forward to trying out 1.4.x ...gg On Sun, Mar 21, 2010 at 3:06 AM, Willy Tarreau w...@1wt.eu wrote: On Sun, Mar 21, 2010 at 12:05:16AM -0400, Greg Gard wrote: thanks holger, i did some research and was able to find more on mongrel and queuing. so that helps to clarify. i am unsure what i will do viz checking in the end as we have some long running requests that are frankly the bane of my existence and complicate load balancing. we need to refactor as part of the solution. just to be complete, are there any plans to have health checks get queued? We'll see how we can do that when checks are reworked, but quite frankly, mongrel users are the *only* ones who need that, and when you have one server which can take one minute to respond to a single request, you have far more important trouble to worry about than if health checks will be queued or not ! If a server can only do one thing at a time, you must design it to do it extremely fast. In your case, someone could ruin your day by just sending a few repeated clicks to your server and feed it some work for one full day... There's obviously something wrong ! As Holger indicated it, mongrel can queue requests, so if your server had occasionally long response times of one SECOND, the check would just transparently be queued and processed. But at some point, infrastructure elements can't work around bad code, and nothing but fixing the code will make your users accept to wait for the response. Just imagine if someone posted a link to your site on a forum or any regularly visited site. Your site would then be permanently down... As a user, when I see that a site does not respond within 5-7 seconds, I first check my internet connectivity. After that, I declare the site dead and go somewhere else, which is especially true with online stores. You don't even know if *any* of your users have ever waited for your site to respond to the 60+ sec requests. Also you should consider the cost of fixing the code versus paying the electricity bill... Assuming your server consumes 400W, at 60s it consumes 24 kJ per click !!! The equivalent of a 60W light bulb for 7 minutes. For 2 clicks which will take 1 second to haproxy, you'll get work for 2 full weeks on your server or about 133 kWh !!! SO YES, THERE IS SOMETHING DEFINITELY WRONG IN HAVING A WEB SERVER TAKE 60+ SEC TO PROCESS ONE REQUEST. And excuse me for being so crude, but if you don't fix that, your site is deemed to fail long before it gets minimal audience. In the mean time, the only thing I can suggest you is to use very large check timeouts (larger than the longer supposedly valid request), with a low retry count to avoid taking too much time to declare a server down, and probably make use of the observe, on-error and error-limit server options to be able to set your server down as soon as a 5xx response is returned to a client. Hoping this helps, Willy -- greg gard, psyd www.carepaths.com
Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs
Willy Tarreau wrote: What would you consider a good indicator of its reliability? Would running flawlessly for a week straight be enough of testing? The fact that it runs a lot longer than previous run is a natural indicator of reliability. However it's not an indicator of correctness. I sure agree that it isn't any proof of its correctness but I can only say that it's been running for more than 40 hours now and I don't see any problems. I'll spare you the details of how many times the backend servers crashed in that time ;-) Whatever we spot, I'll keep in mind that we can get it to crash on your machine in 31-bit mode. If ever I come across a vicious bug that could explain that, I'd be happy to ask you to give it a try. And I'll be happy to give it a go if I only still have access to that platform. Just in case you ever need it, you can run Debian (or, I imagine, any other distribution which supports s390/s390x) under the Hercules VM, here's a very nice HOWTO http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't tried using HAProxy on it though but I guess there shouldn't be any issues. Last, are you aware of any version that has worked reliably on your platform ? Not really, it's the first time we're using HAProxy on that platform. OK so I wish you that it works well for this first time :-) Cheers! -- Dariusz Suchojad
Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs
Hi Dariusz, On Mon, Mar 22, 2010 at 03:54:13AM +0100, Dariusz Suchojad wrote: Willy Tarreau wrote: What would you consider a good indicator of its reliability? Would running flawlessly for a week straight be enough of testing? The fact that it runs a lot longer than previous run is a natural indicator of reliability. However it's not an indicator of correctness. I sure agree that it isn't any proof of its correctness but I can only say that it's been running for more than 40 hours now and I don't see any problems. I'll spare you the details of how many times the backend servers crashed in that time ;-) OK so now I'm confident that it is the 31-bit mode that triggers the problem. Whatever we spot, I'll keep in mind that we can get it to crash on your machine in 31-bit mode. If ever I come across a vicious bug that could explain that, I'd be happy to ask you to give it a try. And I'll be happy to give it a go if I only still have access to that platform. Just in case you ever need it, you can run Debian (or, I imagine, any other distribution which supports s390/s390x) under the Hercules VM, here's a very nice HOWTO http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't tried using HAProxy on it though but I guess there shouldn't be any issues. Oh I've never heard of this VM. That's excellent. And Josef has put up a very nice howto ! I'll probably try it someday, at least to satisfy my curiosity :-) Cheers, Willy