Hi Willy,

On Wed, May 19, 2010 at 9:39 PM, Willy Tarreau <[email protected]> wrote:

> Hi Chih Yin,
>
> On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote:
> > > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote:
> > > > As for the logs, it seems that I'll need to look at the configuration
> for
> > > > HAProxy a bit more to make some adjustments first.  A few months
> back, I
> > > > know I saw messages indicating the status of server (e.g. 3 active, 2
> > > > backup).
> > >
> > > Normally this means that a server is failing to respond to some health
> > > checks,
> > > either because it crashed or froze, or because it's overloaded.
> > >
> > >
> > Wow.  I'm growing concerned with this.  What I've noticed is that these
> > messages were encountered almost daily for almost a year, but disappeared
> > since we migrated to the blade servers.  The disconcerting part is that
> > since we made that migration, all indications is that the virtual servers
> > have been less reliable than before.  Yet, I haven't seen these messages
> at
> > all.
>
> And most likely it is because you don't have a separate log anymore that
> you don't see the messages. Please try a simple test on your logs : look
> for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to
> look for the 'is' word surrounded with spaces :
>
>  $ fgrep ' is ' haproxy.log
>
> You can even check for messages indicating that you have lost your last
> server :
>
>  $ fgrep ' has no server ' haproxy.log
>
> If your logs have not been filtered out, you should find these events.
>
>
I ran both commands and did not get any results.  It would seem that I need
to search for other locations where this information might be kept.


> > > What I see is that your "contimeout" is set to 8 seconds and you have
> no
> > > "timeout queue". In this case, the queue timeout defaults to the
> > > contimeout,
> > > which is rather short. It means that when all your servers are
> saturated, a
> > > request will go to the queue and if no server releases a connection
> within
> > > 8 seconds, the client will get a 503. At least you should add
> > > "timeout queue 80s" to give more chances to your new client requests to
> get
> > > served within the previous requests' timeout. While this is a very high
> > > timer
> > > it might help troubleshoot your issues.
> > >
> > >
> > I guess I'm a bit confused.  In the configuration file, I see the
> following
> > in the defaults section:
> >
> > defaults
> >     mode            http
> >     maxconn         1024
> >     *contimeout      8000*
> >     clitimeout      80000
> >     srvtimeout      80000
> >     *timeout queue   50000*
>
> Ah yes, sorry about that, I missed it when quickly reviewing your
> config. Maybe because of the mixed syntax. So that means that your
> users will wait up to 50s in the queue, which should be more than
> enough. So most likely the 503s are only caused by cases where you
> don't have any remaining server up.
>
> One important point I've just noticed : you don't have
> "option abortonclose". You should definitely have it with
> that long timeouts, because there are high chances that most
> users won't wait that long or will click the reload button while
> their request is in the queue. With that option enabled, the old
> pending request will be aborted if the user clicks stop or reload.
> This is important, otherwise you could get a lot of requests in
> queue if the guy clicks reload 10 times in a row.
>
>
Thank you.  I have a feeling this will be a very helpful suggestion.  From
what I observe of how our internal users behave on the website, this change
will have a tremendous impact.


> > Am I misunderstanding and looking at the wrong spot?  Also, is there a
> > standard timeout for the queue that is reasonable, or would this be a
> value
> > that varies from website to website?
>
> It varies from site to site, and should reflect the maximum time
> you think a user will accept to wait. But a good guess is to use
> the same value as the server timeout because it should also be set
> to the maximum time a user will accept to wait :-)
>
>
At this point, I'm very grateful for our users, most of whom seem to have
infinite patience.  :)


> But you should be aware that 50 or 80 seconds are extremely long.
> Some sites require that large timeouts for a very specific request
> which can take a long time, but your average request time should
> be below the hundreds of milliseconds for dynamic objects and
> around the millisecond for static objects. I suggest that you pass
> "halog -pct" on your logs, it will show you how your response times
> are spread.
>
>
I am trying to think of possible reasons that the timeout was set to 50 and
80 seconds.  The only think I can think of is that there is a lot of
inter-server traffic occurring to respond to some of the requests.  Maybe
the timeout was set so that for some of this content, the initial request
will not timeout while waiting for the back-end servers to respond?

I am attempting to update the HAProxy configuration file I sent out earlier
this week to incorporate all the changes everyone have suggested.  I think
I'll be ready to test this out tomorrow and early next week.

C.Y.


> Regards,
> Willy
>
>

Reply via email to