Re: queued health checks?

2010-03-21 Thread Willy Tarreau
On Sun, Mar 21, 2010 at 12:05:16AM -0400, Greg Gard wrote:
 thanks holger,
 
 i did some research and was able to find more on mongrel and queuing.
 so that helps to clarify. i am unsure what i will do viz checking in
 the end as we have some long running requests that are frankly the
 bane of my existence and complicate load balancing. we need to
 refactor as part of the solution.
 
 just to be complete, are there any plans to have health checks get queued?

We'll see how we can do that when checks are reworked, but quite
frankly, mongrel users are the *only* ones who need that, and when
you have one server which can take one minute to respond to a single
request, you have far more important trouble to worry about than if
health checks will be queued or not ! If a server can only do one
thing at a time, you must design it to do it extremely fast. In your
case, someone could ruin your day by just sending a few repeated
clicks to your server and feed it some work for one full day...
There's obviously something wrong !

As Holger indicated it, mongrel can queue requests, so if your
server had occasionally long response times of one SECOND, the check
would just transparently be queued and processed. But at some point,
infrastructure elements can't work around bad code, and nothing but
fixing the code will make your users accept to wait for the response.
Just imagine if someone posted a link to your site on a forum or any
regularly visited site. Your site would then be permanently down...

As a user, when I see that a site does not respond within 5-7
seconds, I first check my internet connectivity. After that, I
declare the site dead and go somewhere else, which is especially
true with online stores. You don't even know if *any* of your users
have ever waited for your site to respond to the 60+ sec requests.

Also you should consider the cost of fixing the code versus paying
the electricity bill... Assuming your server consumes 400W, at 60s
it consumes 24 kJ per click !!! The equivalent of a 60W light bulb
for 7 minutes. For 2 clicks which will take 1 second to haproxy,
you'll get work for 2 full weeks on your server or about 133 kWh !!!
SO YES, THERE IS SOMETHING DEFINITELY WRONG IN HAVING A WEB SERVER
TAKE 60+ SEC TO PROCESS ONE REQUEST. And excuse me for being so crude,
but if you don't fix that, your site is deemed to fail long before it
gets minimal audience.

In the mean time, the only thing I can suggest you is to use very
large check timeouts (larger than the longer supposedly valid
request), with a low retry count to avoid taking too much time to
declare a server down, and probably make use of the observe,
on-error and error-limit server options to be able to set your
server down as soon as a 5xx response is returned to a client.

Hoping this helps,
Willy




Re: queued health checks?

2010-03-21 Thread Greg Gard
yes it does willy. thanks. and i share your exasperation with our
situation. rails/mongrel is a pain AND we have tons of very slow/cpu
intensive legacy asp reporting code that is mission critical. we are
trying to undo the sins of our past (namely taking microsoft's word
for it...sorry any .net fans out there). anyway, we love haproxy and
your suggestions over the past few months have worked out for the
better. so i appreciate it despite feeling a bit sheepish after
reading your comments. :)

looking forward to trying out 1.4.x

...gg

On Sun, Mar 21, 2010 at 3:06 AM, Willy Tarreau w...@1wt.eu wrote:
 On Sun, Mar 21, 2010 at 12:05:16AM -0400, Greg Gard wrote:
 thanks holger,

 i did some research and was able to find more on mongrel and queuing.
 so that helps to clarify. i am unsure what i will do viz checking in
 the end as we have some long running requests that are frankly the
 bane of my existence and complicate load balancing. we need to
 refactor as part of the solution.

 just to be complete, are there any plans to have health checks get queued?

 We'll see how we can do that when checks are reworked, but quite
 frankly, mongrel users are the *only* ones who need that, and when
 you have one server which can take one minute to respond to a single
 request, you have far more important trouble to worry about than if
 health checks will be queued or not ! If a server can only do one
 thing at a time, you must design it to do it extremely fast. In your
 case, someone could ruin your day by just sending a few repeated
 clicks to your server and feed it some work for one full day...
 There's obviously something wrong !

 As Holger indicated it, mongrel can queue requests, so if your
 server had occasionally long response times of one SECOND, the check
 would just transparently be queued and processed. But at some point,
 infrastructure elements can't work around bad code, and nothing but
 fixing the code will make your users accept to wait for the response.
 Just imagine if someone posted a link to your site on a forum or any
 regularly visited site. Your site would then be permanently down...

 As a user, when I see that a site does not respond within 5-7
 seconds, I first check my internet connectivity. After that, I
 declare the site dead and go somewhere else, which is especially
 true with online stores. You don't even know if *any* of your users
 have ever waited for your site to respond to the 60+ sec requests.

 Also you should consider the cost of fixing the code versus paying
 the electricity bill... Assuming your server consumes 400W, at 60s
 it consumes 24 kJ per click !!! The equivalent of a 60W light bulb
 for 7 minutes. For 2 clicks which will take 1 second to haproxy,
 you'll get work for 2 full weeks on your server or about 133 kWh !!!
 SO YES, THERE IS SOMETHING DEFINITELY WRONG IN HAVING A WEB SERVER
 TAKE 60+ SEC TO PROCESS ONE REQUEST. And excuse me for being so crude,
 but if you don't fix that, your site is deemed to fail long before it
 gets minimal audience.

 In the mean time, the only thing I can suggest you is to use very
 large check timeouts (larger than the longer supposedly valid
 request), with a low retry count to avoid taking too much time to
 declare a server down, and probably make use of the observe,
 on-error and error-limit server options to be able to set your
 server down as soon as a 5xx response is returned to a client.

 Hoping this helps,
 Willy





-- 
greg gard, psyd
www.carepaths.com



Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-21 Thread Dariusz Suchojad

Willy Tarreau wrote:


What would you consider a good indicator of its reliability? Would
running flawlessly for a week straight be enough of testing?


The fact that it runs a lot longer than previous run is a natural
indicator of reliability. However it's not an indicator of correctness.


I sure agree that it isn't any proof of its correctness but I can only 
say that it's been running for more than 40 hours now and I don't see 
any problems. I'll spare you the details of how many times the backend 
servers crashed in that time ;-)


 Whatever we spot, I'll keep in mind that we can get it to crash on
 your machine in 31-bit mode. If ever I come across a vicious bug
 that could explain that, I'd be happy to ask you to give it a try.

And I'll be happy to give it a go if I only still have access to that 
platform. Just in case you ever need it, you can run Debian (or, I 
imagine, any other distribution which supports s390/s390x) under the 
Hercules VM, here's a very nice HOWTO 
http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't 
tried using HAProxy on it though but I guess there shouldn't be any issues.



Last, are you aware of any version that has worked reliably on
your platform ?


Not really, it's the first time we're using HAProxy on that platform.


OK so I wish you that it works well for this first time :-)


Cheers!

--
Dariusz Suchojad



Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-21 Thread Willy Tarreau
Hi Dariusz,

On Mon, Mar 22, 2010 at 03:54:13AM +0100, Dariusz Suchojad wrote:
 Willy Tarreau wrote:
 
 What would you consider a good indicator of its reliability? Would
 running flawlessly for a week straight be enough of testing?
 
 The fact that it runs a lot longer than previous run is a natural
 indicator of reliability. However it's not an indicator of correctness.
 
 I sure agree that it isn't any proof of its correctness but I can only 
 say that it's been running for more than 40 hours now and I don't see 
 any problems. I'll spare you the details of how many times the backend 
 servers crashed in that time ;-)

OK so now I'm confident that it is the 31-bit mode that triggers the
problem.

  Whatever we spot, I'll keep in mind that we can get it to crash on
  your machine in 31-bit mode. If ever I come across a vicious bug
  that could explain that, I'd be happy to ask you to give it a try.
 
 And I'll be happy to give it a go if I only still have access to that 
 platform. Just in case you ever need it, you can run Debian (or, I 
 imagine, any other distribution which supports s390/s390x) under the 
 Hercules VM, here's a very nice HOWTO 
 http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't 
 tried using HAProxy on it though but I guess there shouldn't be any issues.

Oh I've never heard of this VM. That's excellent. And Josef has put
up a very nice howto ! I'll probably try it someday, at least to
satisfy my curiosity :-)

Cheers,
Willy