Hi Chih Yin,

On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote:
> > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote:
> > > As for the logs, it seems that I'll need to look at the configuration for
> > > HAProxy a bit more to make some adjustments first.  A few months back, I
> > > know I saw messages indicating the status of server (e.g. 3 active, 2
> > > backup).
> >
> > Normally this means that a server is failing to respond to some health
> > checks,
> > either because it crashed or froze, or because it's overloaded.
> >
> >
> Wow.  I'm growing concerned with this.  What I've noticed is that these
> messages were encountered almost daily for almost a year, but disappeared
> since we migrated to the blade servers.  The disconcerting part is that
> since we made that migration, all indications is that the virtual servers
> have been less reliable than before.  Yet, I haven't seen these messages at
> all.

And most likely it is because you don't have a separate log anymore that
you don't see the messages. Please try a simple test on your logs : look
for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to
look for the 'is' word surrounded with spaces :

  $ fgrep ' is ' haproxy.log

You can even check for messages indicating that you have lost your last
server :

  $ fgrep ' has no server ' haproxy.log

If your logs have not been filtered out, you should find these events.

> > What I see is that your "contimeout" is set to 8 seconds and you have no
> > "timeout queue". In this case, the queue timeout defaults to the
> > contimeout,
> > which is rather short. It means that when all your servers are saturated, a
> > request will go to the queue and if no server releases a connection within
> > 8 seconds, the client will get a 503. At least you should add
> > "timeout queue 80s" to give more chances to your new client requests to get
> > served within the previous requests' timeout. While this is a very high
> > timer
> > it might help troubleshoot your issues.
> >
> >
> I guess I'm a bit confused.  In the configuration file, I see the following
> in the defaults section:
> 
> defaults
>     mode            http
>     maxconn         1024
>     *contimeout      8000*
>     clitimeout      80000
>     srvtimeout      80000
>     *timeout queue   50000*

Ah yes, sorry about that, I missed it when quickly reviewing your
config. Maybe because of the mixed syntax. So that means that your
users will wait up to 50s in the queue, which should be more than
enough. So most likely the 503s are only caused by cases where you
don't have any remaining server up.

One important point I've just noticed : you don't have
"option abortonclose". You should definitely have it with
that long timeouts, because there are high chances that most
users won't wait that long or will click the reload button while
their request is in the queue. With that option enabled, the old
pending request will be aborted if the user clicks stop or reload.
This is important, otherwise you could get a lot of requests in
queue if the guy clicks reload 10 times in a row.

> Am I misunderstanding and looking at the wrong spot?  Also, is there a
> standard timeout for the queue that is reasonable, or would this be a value
> that varies from website to website?

It varies from site to site, and should reflect the maximum time
you think a user will accept to wait. But a good guess is to use
the same value as the server timeout because it should also be set
to the maximum time a user will accept to wait :-)

But you should be aware that 50 or 80 seconds are extremely long.
Some sites require that large timeouts for a very specific request
which can take a long time, but your average request time should
be below the hundreds of milliseconds for dynamic objects and
around the millisecond for static objects. I suggest that you pass
"halog -pct" on your logs, it will show you how your response times
are spread.

Regards,
Willy


Reply via email to