Hi Chih Yin, On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote: > > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote: > > > As for the logs, it seems that I'll need to look at the configuration for > > > HAProxy a bit more to make some adjustments first. A few months back, I > > > know I saw messages indicating the status of server (e.g. 3 active, 2 > > > backup). > > > > Normally this means that a server is failing to respond to some health > > checks, > > either because it crashed or froze, or because it's overloaded. > > > > > Wow. I'm growing concerned with this. What I've noticed is that these > messages were encountered almost daily for almost a year, but disappeared > since we migrated to the blade servers. The disconcerting part is that > since we made that migration, all indications is that the virtual servers > have been less reliable than before. Yet, I haven't seen these messages at > all.
And most likely it is because you don't have a separate log anymore that you don't see the messages. Please try a simple test on your logs : look for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to look for the 'is' word surrounded with spaces : $ fgrep ' is ' haproxy.log You can even check for messages indicating that you have lost your last server : $ fgrep ' has no server ' haproxy.log If your logs have not been filtered out, you should find these events. > > What I see is that your "contimeout" is set to 8 seconds and you have no > > "timeout queue". In this case, the queue timeout defaults to the > > contimeout, > > which is rather short. It means that when all your servers are saturated, a > > request will go to the queue and if no server releases a connection within > > 8 seconds, the client will get a 503. At least you should add > > "timeout queue 80s" to give more chances to your new client requests to get > > served within the previous requests' timeout. While this is a very high > > timer > > it might help troubleshoot your issues. > > > > > I guess I'm a bit confused. In the configuration file, I see the following > in the defaults section: > > defaults > mode http > maxconn 1024 > *contimeout 8000* > clitimeout 80000 > srvtimeout 80000 > *timeout queue 50000* Ah yes, sorry about that, I missed it when quickly reviewing your config. Maybe because of the mixed syntax. So that means that your users will wait up to 50s in the queue, which should be more than enough. So most likely the 503s are only caused by cases where you don't have any remaining server up. One important point I've just noticed : you don't have "option abortonclose". You should definitely have it with that long timeouts, because there are high chances that most users won't wait that long or will click the reload button while their request is in the queue. With that option enabled, the old pending request will be aborted if the user clicks stop or reload. This is important, otherwise you could get a lot of requests in queue if the guy clicks reload 10 times in a row. > Am I misunderstanding and looking at the wrong spot? Also, is there a > standard timeout for the queue that is reasonable, or would this be a value > that varies from website to website? It varies from site to site, and should reflect the maximum time you think a user will accept to wait. But a good guess is to use the same value as the server timeout because it should also be set to the maximum time a user will accept to wait :-) But you should be aware that 50 or 80 seconds are extremely long. Some sites require that large timeouts for a very specific request which can take a long time, but your average request time should be below the hundreds of milliseconds for dynamic objects and around the millisecond for static objects. I suggest that you pass "halog -pct" on your logs, it will show you how your response times are spread. Regards, Willy

