Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-20 Thread Chih Yin
On Wed, May 19, 2010 at 9:43 PM, Willy Tarreau  wrote:

> On Wed, May 19, 2010 at 04:49:02PM -0700, Chih Yin wrote:
> > Hi Mariusz,
> >
> > On Wed, May 19, 2010 at 2:18 PM, Mariusz Gronczewski  >wrote:
> >
> > > One more thing about config, u dont need to do
> > > acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0
> > > acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1
> > > acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2
> > > and then
> > >   use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03
> > >
> > > u can just do
> > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.0
> > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.1
> > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.2
> > >
> > > and then
> > >  use_backend robot_traffic if is_msn
> > >
> > > ACLs with same name are automatically ORed together.
> > >
> > > or better yet, match bots by user-agent not by IP
> > > http://www.useragentstring.com/pages/useragentstring.php
> > >
> > >
> > Thank you so much.  This is definitely helpful!
>
> Also, since 1.3.21 you have the "hdr_ip" ACL which can parse
> IP addresses from headers. What that means is that instead of
> doing sub-string matching, you can match networks, which is
> faster and allows globbing. For instance :
>
> acl is_msnhdr_sub(X-Forwarded-For) 64.4.0
> acl is_msnhdr_sub(X-Forwarded-For) 64.4.1
>
> can be replaced by :
>
> acl is_msnhdr_ip(X-Forwarded-For) 64.4.0.0/15
>
> And with 1.4.6, you'll even be able to fill all known networks
> in a file and load them in one line :
>
> acl is_msnhdr_ip(X-Forwarded-For) -f /etc/haproxy/msn_networks.txt
>
>
Thank you for the suggestion Willy.  I will definitely give this a try as
well.

C.Y.


> Willy
>
>


Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-20 Thread Chih Yin
Hi Willy,

On Wed, May 19, 2010 at 9:39 PM, Willy Tarreau  wrote:

> Hi Chih Yin,
>
> On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote:
> > > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote:
> > > > As for the logs, it seems that I'll need to look at the configuration
> for
> > > > HAProxy a bit more to make some adjustments first.  A few months
> back, I
> > > > know I saw messages indicating the status of server (e.g. 3 active, 2
> > > > backup).
> > >
> > > Normally this means that a server is failing to respond to some health
> > > checks,
> > > either because it crashed or froze, or because it's overloaded.
> > >
> > >
> > Wow.  I'm growing concerned with this.  What I've noticed is that these
> > messages were encountered almost daily for almost a year, but disappeared
> > since we migrated to the blade servers.  The disconcerting part is that
> > since we made that migration, all indications is that the virtual servers
> > have been less reliable than before.  Yet, I haven't seen these messages
> at
> > all.
>
> And most likely it is because you don't have a separate log anymore that
> you don't see the messages. Please try a simple test on your logs : look
> for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to
> look for the 'is' word surrounded with spaces :
>
>  $ fgrep ' is ' haproxy.log
>
> You can even check for messages indicating that you have lost your last
> server :
>
>  $ fgrep ' has no server ' haproxy.log
>
> If your logs have not been filtered out, you should find these events.
>
>
I ran both commands and did not get any results.  It would seem that I need
to search for other locations where this information might be kept.


> > > What I see is that your "contimeout" is set to 8 seconds and you have
> no
> > > "timeout queue". In this case, the queue timeout defaults to the
> > > contimeout,
> > > which is rather short. It means that when all your servers are
> saturated, a
> > > request will go to the queue and if no server releases a connection
> within
> > > 8 seconds, the client will get a 503. At least you should add
> > > "timeout queue 80s" to give more chances to your new client requests to
> get
> > > served within the previous requests' timeout. While this is a very high
> > > timer
> > > it might help troubleshoot your issues.
> > >
> > >
> > I guess I'm a bit confused.  In the configuration file, I see the
> following
> > in the defaults section:
> >
> > defaults
> > modehttp
> > maxconn 1024
> > *contimeout  8000*
> > clitimeout  8
> > srvtimeout  8
> > *timeout queue   5*
>
> Ah yes, sorry about that, I missed it when quickly reviewing your
> config. Maybe because of the mixed syntax. So that means that your
> users will wait up to 50s in the queue, which should be more than
> enough. So most likely the 503s are only caused by cases where you
> don't have any remaining server up.
>
> One important point I've just noticed : you don't have
> "option abortonclose". You should definitely have it with
> that long timeouts, because there are high chances that most
> users won't wait that long or will click the reload button while
> their request is in the queue. With that option enabled, the old
> pending request will be aborted if the user clicks stop or reload.
> This is important, otherwise you could get a lot of requests in
> queue if the guy clicks reload 10 times in a row.
>
>
Thank you.  I have a feeling this will be a very helpful suggestion.  From
what I observe of how our internal users behave on the website, this change
will have a tremendous impact.


> > Am I misunderstanding and looking at the wrong spot?  Also, is there a
> > standard timeout for the queue that is reasonable, or would this be a
> value
> > that varies from website to website?
>
> It varies from site to site, and should reflect the maximum time
> you think a user will accept to wait. But a good guess is to use
> the same value as the server timeout because it should also be set
> to the maximum time a user will accept to wait :-)
>
>
At this point, I'm very grateful for our users, most of whom seem to have
infinite patience.  :)


> But you should be aware that 50 or 80 seconds are extremely long.
> Some sites require that large timeouts for a very specific request
> which can take a long time, but your average request time should
> be below the hundreds of milliseconds for dynamic objects and
> around the millisecond for static objects. I suggest that you pass
> "halog -pct" on your logs, it will show you how your response times
> are spread.
>
>
I am trying to think of possible reasons that the timeout was set to 50 and
80 seconds.  The only think I can think of is that there is a lot of
inter-server traffic occurring to respond to some of the requests.  Maybe
the timeout was set so that for some of this content, the initial request
will not timeout while waiting for the back-end servers to

Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Willy Tarreau
On Wed, May 19, 2010 at 04:49:02PM -0700, Chih Yin wrote:
> Hi Mariusz,
> 
> On Wed, May 19, 2010 at 2:18 PM, Mariusz Gronczewski wrote:
> 
> > One more thing about config, u dont need to do
> > acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0
> > acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1
> > acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2
> > and then
> >   use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03
> >
> > u can just do
> > acl is_msnhdr_sub(X-Forwarded-For) 64.4.0
> > acl is_msnhdr_sub(X-Forwarded-For) 64.4.1
> > acl is_msnhdr_sub(X-Forwarded-For) 64.4.2
> >
> > and then
> >  use_backend robot_traffic if is_msn
> >
> > ACLs with same name are automatically ORed together.
> >
> > or better yet, match bots by user-agent not by IP
> > http://www.useragentstring.com/pages/useragentstring.php
> >
> >
> Thank you so much.  This is definitely helpful!

Also, since 1.3.21 you have the "hdr_ip" ACL which can parse
IP addresses from headers. What that means is that instead of
doing sub-string matching, you can match networks, which is
faster and allows globbing. For instance :

 acl is_msnhdr_sub(X-Forwarded-For) 64.4.0
 acl is_msnhdr_sub(X-Forwarded-For) 64.4.1

can be replaced by :

 acl is_msnhdr_ip(X-Forwarded-For) 64.4.0.0/15

And with 1.4.6, you'll even be able to fill all known networks
in a file and load them in one line :

 acl is_msnhdr_ip(X-Forwarded-For) -f /etc/haproxy/msn_networks.txt

Willy




Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Willy Tarreau
Hi Chih Yin,

On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote:
> > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote:
> > > As for the logs, it seems that I'll need to look at the configuration for
> > > HAProxy a bit more to make some adjustments first.  A few months back, I
> > > know I saw messages indicating the status of server (e.g. 3 active, 2
> > > backup).
> >
> > Normally this means that a server is failing to respond to some health
> > checks,
> > either because it crashed or froze, or because it's overloaded.
> >
> >
> Wow.  I'm growing concerned with this.  What I've noticed is that these
> messages were encountered almost daily for almost a year, but disappeared
> since we migrated to the blade servers.  The disconcerting part is that
> since we made that migration, all indications is that the virtual servers
> have been less reliable than before.  Yet, I haven't seen these messages at
> all.

And most likely it is because you don't have a separate log anymore that
you don't see the messages. Please try a simple test on your logs : look
for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to
look for the 'is' word surrounded with spaces :

  $ fgrep ' is ' haproxy.log

You can even check for messages indicating that you have lost your last
server :

  $ fgrep ' has no server ' haproxy.log

If your logs have not been filtered out, you should find these events.

> > What I see is that your "contimeout" is set to 8 seconds and you have no
> > "timeout queue". In this case, the queue timeout defaults to the
> > contimeout,
> > which is rather short. It means that when all your servers are saturated, a
> > request will go to the queue and if no server releases a connection within
> > 8 seconds, the client will get a 503. At least you should add
> > "timeout queue 80s" to give more chances to your new client requests to get
> > served within the previous requests' timeout. While this is a very high
> > timer
> > it might help troubleshoot your issues.
> >
> >
> I guess I'm a bit confused.  In the configuration file, I see the following
> in the defaults section:
> 
> defaults
> modehttp
> maxconn 1024
> *contimeout  8000*
> clitimeout  8
> srvtimeout  8
> *timeout queue   5*

Ah yes, sorry about that, I missed it when quickly reviewing your
config. Maybe because of the mixed syntax. So that means that your
users will wait up to 50s in the queue, which should be more than
enough. So most likely the 503s are only caused by cases where you
don't have any remaining server up.

One important point I've just noticed : you don't have
"option abortonclose". You should definitely have it with
that long timeouts, because there are high chances that most
users won't wait that long or will click the reload button while
their request is in the queue. With that option enabled, the old
pending request will be aborted if the user clicks stop or reload.
This is important, otherwise you could get a lot of requests in
queue if the guy clicks reload 10 times in a row.

> Am I misunderstanding and looking at the wrong spot?  Also, is there a
> standard timeout for the queue that is reasonable, or would this be a value
> that varies from website to website?

It varies from site to site, and should reflect the maximum time
you think a user will accept to wait. But a good guess is to use
the same value as the server timeout because it should also be set
to the maximum time a user will accept to wait :-)

But you should be aware that 50 or 80 seconds are extremely long.
Some sites require that large timeouts for a very specific request
which can take a long time, but your average request time should
be below the hundreds of milliseconds for dynamic objects and
around the millisecond for static objects. I suggest that you pass
"halog -pct" on your logs, it will show you how your response times
are spread.

Regards,
Willy




Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Chih Yin
Hi Mariusz,

On Wed, May 19, 2010 at 2:18 PM, Mariusz Gronczewski wrote:

> One more thing about config, u dont need to do
> acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0
> acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1
> acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2
> and then
>   use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03
>
> u can just do
> acl is_msnhdr_sub(X-Forwarded-For) 64.4.0
> acl is_msnhdr_sub(X-Forwarded-For) 64.4.1
> acl is_msnhdr_sub(X-Forwarded-For) 64.4.2
>
> and then
>  use_backend robot_traffic if is_msn
>
> ACLs with same name are automatically ORed together.
>
> or better yet, match bots by user-agent not by IP
> http://www.useragentstring.com/pages/useragentstring.php
>
>
Thank you so much.  This is definitely helpful!

C.Y.


Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Chih Yin
Hi Willy,

On Wed, May 19, 2010 at 5:16 AM, Willy Tarreau  wrote:

> Hi,
>
> I have a quick comment for now, before going deep through your config.
>
> On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote:
> > As for the logs, it seems that I'll need to look at the configuration for
> > HAProxy a bit more to make some adjustments first.  A few months back, I
> > know I saw messages indicating the status of server (e.g. 3 active, 2
> > backup).
>
> Normally this means that a server is failing to respond to some health
> checks,
> either because it crashed or froze, or because it's overloaded.
>
>
Wow.  I'm growing concerned with this.  What I've noticed is that these
messages were encountered almost daily for almost a year, but disappeared
since we migrated to the blade servers.  The disconcerting part is that
since we made that migration, all indications is that the virtual servers
have been less reliable than before.  Yet, I haven't seen these messages at
all.


> What I see is that your "contimeout" is set to 8 seconds and you have no
> "timeout queue". In this case, the queue timeout defaults to the
> contimeout,
> which is rather short. It means that when all your servers are saturated, a
> request will go to the queue and if no server releases a connection within
> 8 seconds, the client will get a 503. At least you should add
> "timeout queue 80s" to give more chances to your new client requests to get
> served within the previous requests' timeout. While this is a very high
> timer
> it might help troubleshoot your issues.
>
>
I guess I'm a bit confused.  In the configuration file, I see the following
in the defaults section:

defaults
modehttp
maxconn 1024
*contimeout  8000*
clitimeout  8
srvtimeout  8
*timeout queue   5*

Am I misunderstanding and looking at the wrong spot?  Also, is there a
standard timeout for the queue that is reasonable, or would this be a value
that varies from website to website?

Thank you,
C.Y.





> Regards,
> Willy
>
>


Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Chih Yin
Hi Hank,

On Tue, May 18, 2010 at 7:45 PM, Hank A. Paulson <
h...@spamproof.nospammail.net> wrote:

> You do have bunch of services that are http mode that don't seem to have
> any type of http close. Some I don't understand why they are not http mode
> and they probably should be.
>
> Just a note you may be able to greatly simplify (and possibly speed up)
> your config using the new capabilities for tables of IPs added in 1.4.6.
>
> solr should probably be http mode and anywhere else that you have http mode
> you probably want an http close option turned on.
>
> I am not sure why they chose dispatch for the prod glassfish server, my
> guess is they are running apache and mod_jk or something and then forwarding
> the requests to different glassfish servers - are there really more than one
> prod glassfish servers? I am wondering if the previous admin set up more
> than one copy of haproxy and that is why several services are redirected to
> the same machine - like glassfish prod there is no other reference to port
> 4850 in this config, so what is running on port 4850? haproxy/apache/heaven
> forbid - glassfish itself? netstat -antope | fgrep LIST | fgrep 4850
>
> I think one of the problems is the "inter_server" it doesn't have http mode
> set so if more than one hit/request comes in on an open connection then your
> request parsing rules are not run on any requests except the first one (as
> Wille keeps reminding people). That might work ok for most things since you
> are mostly breaking things up by service liferay goes to the liferay
> servers, etc - the problem comes in if you have a portal that people sign
> into and then have a menu/navbar that they can choose different services
> that should be going to different front/backends.
>
>
Thank you for these suggestions.  I also saw your follow-up email.  I will
start applying and testing the changes you recommended.  I think your
comments on the "inter_server" is probably close to the truth.  I'm
definitely seeing some unexplained behavior from our website when users are
utilizing different services.

I've also made the recommendation to my director to evaluate version 1.4.6
while we plan the upgrade from 1.3.21 to 1.3.22.


>
> On 5/18/10 3:49 PM, Chih Yin wrote:
>
>>
>>
>> On Mon, May 17, 2010 at 11:11 PM, Hank A. Paulson
>> mailto:h...@spamproof.nospammail.net>>
>> wrote:
>>
>>On 5/17/10 10:24 PM, Willy Tarreau wrote:
>>
>>On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote:
>>
>>I have some sites running a similar set up - Xen domU,
>>keepalived,
>>fedora not RHEL and they get 50+ million hits per day with
>>pretty
>>fast response. you might want to use the "log separate
>>errors" (sp?)
>>option and review those 50X errors carefully, you might see
>>a pattern
>>- do you have http-close* in all you configs? That got me
>>weird, slow
>>results when I missed it once.
>>
>>
>>Indeed, that *could* be a possibility if combined with a server
>>maxconn
>>because connections would be kept for a long time on the server
>>(waiting
>>for either the client or the server to close) and during that
>>time nobody
>>else could connect. The typical problem with keep-alive to the
>>servers in
>>fact. The 503 could be caused by requests waiting too long in
>>the queue
>>then.
>>
>>
>>My example was just to assure Chin Yin that haproxy on xen should be
>>able to handle his current load depending, of course, on the
>>glassfish servers.
>>
>>I meant some kind of httpclose option
>>(httpclose/forceclose/http-server-close/etc) turned on regardless of
>>keep-alive status - you know, like you are always reminding people :)
>>
>>I noticed when I forgot it on a section (that was not keepalive
>>related) it caused wacky results - hanging browsers,
>>images/icons/css not showing up, etc. Obviously it should not affect
>>single requests like you would assume Akamai would be sending, it
>>was a pure guess.
>>
>>
>> Thank you everyone for your feedback.  I really appreciate your help.
>>
>> Sorry for taking so long to respond.  I had to get permission from my
>> director to post some of the log data and our haproxy configuration
>> file.  I also had to hide a bit more of the configuration than was
>> suggested because of concerns about making the issues we're encountering
>> too public.  I hope you understand.
>>
>>  From my research on HAProxy and high availability websites in general,
>> it seemed to me that compared to other websites, our traffic volume is
>> actually rather light.  In addition to how we have configured HAProxy
>> for our infrastructure, I'm definitely also taking a look at our
>> application servers and our content as well.
>>
>> I started looking at the log files and the HAProxy 

Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Mariusz Gronczewski
One more thing about config, u dont need to do
acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0
acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1
acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2
and then
  use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03

u can just do
acl is_msnhdr_sub(X-Forwarded-For) 64.4.0
acl is_msnhdr_sub(X-Forwarded-For) 64.4.1
acl is_msnhdr_sub(X-Forwarded-For) 64.4.2

and then
 use_backend robot_traffic if is_msn

ACLs with same name are automatically ORed together.

or better yet, match bots by user-agent not by IP
http://www.useragentstring.com/pages/useragentstring.php


Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-19 Thread Willy Tarreau
Hi,

I have a quick comment for now, before going deep through your config.

On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote:
> As for the logs, it seems that I'll need to look at the configuration for
> HAProxy a bit more to make some adjustments first.  A few months back, I
> know I saw messages indicating the status of server (e.g. 3 active, 2
> backup).

Normally this means that a server is failing to respond to some health checks,
either because it crashed or froze, or because it's overloaded.

What I see is that your "contimeout" is set to 8 seconds and you have no
"timeout queue". In this case, the queue timeout defaults to the contimeout,
which is rather short. It means that when all your servers are saturated, a
request will go to the queue and if no server releases a connection within
8 seconds, the client will get a 503. At least you should add
"timeout queue 80s" to give more chances to your new client requests to get
served within the previous requests' timeout. While this is a very high timer
it might help troubleshoot your issues.

Regards,
Willy




Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-18 Thread Hank A. Paulson

On 5/18/10 7:45 PM, Hank A. Paulson wrote:

I am wondering if the
previous admin set up more than one copy of haproxy and that is why
several services are redirected to the same machine - like glassfish
prod there is no other reference to port 4850 in this config, so what is
running on port 4850? haproxy/apache/heaven forbid - glassfish itself?
netstat -antope | fgrep LIST | fgrep 4850


oops, my bad glassfish prod is on a different server - never mind...

172.16.163.1:4850



Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-18 Thread Hank A. Paulson
You do have bunch of services that are http mode that don't seem to have any 
type of http close. Some I don't understand why they are not http mode and 
they probably should be.


Just a note you may be able to greatly simplify (and possibly speed up) your 
config using the new capabilities for tables of IPs added in 1.4.6.


solr should probably be http mode and anywhere else that you have http mode 
you probably want an http close option turned on.


I am not sure why they chose dispatch for the prod glassfish server, my guess 
is they are running apache and mod_jk or something and then forwarding the 
requests to different glassfish servers - are there really more than one prod 
glassfish servers? I am wondering if the previous admin set up more than one 
copy of haproxy and that is why several services are redirected to the same 
machine - like glassfish prod there is no other reference to port 4850 in this 
config, so what is running on port 4850? haproxy/apache/heaven forbid - 
glassfish itself? netstat -antope | fgrep LIST | fgrep 4850


I think one of the problems is the "inter_server" it doesn't have http mode 
set so if more than one hit/request comes in on an open connection then your 
request parsing rules are not run on any requests except the first one (as 
Wille keeps reminding people). That might work ok for most things since you 
are mostly breaking things up by service liferay goes to the liferay servers, 
etc - the problem comes in if you have a portal that people sign into and then 
have a menu/navbar that they can choose different services that should be 
going to different front/backends.


On 5/18/10 3:49 PM, Chih Yin wrote:



On Mon, May 17, 2010 at 11:11 PM, Hank A. Paulson
mailto:h...@spamproof.nospammail.net>> wrote:

On 5/17/10 10:24 PM, Willy Tarreau wrote:

On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote:

I have some sites running a similar set up - Xen domU,
keepalived,
fedora not RHEL and they get 50+ million hits per day with
pretty
fast response. you might want to use the "log separate
errors" (sp?)
option and review those 50X errors carefully, you might see
a pattern
- do you have http-close* in all you configs? That got me
weird, slow
results when I missed it once.


Indeed, that *could* be a possibility if combined with a server
maxconn
because connections would be kept for a long time on the server
(waiting
for either the client or the server to close) and during that
time nobody
else could connect. The typical problem with keep-alive to the
servers in
fact. The 503 could be caused by requests waiting too long in
the queue
then.


My example was just to assure Chin Yin that haproxy on xen should be
able to handle his current load depending, of course, on the
glassfish servers.

I meant some kind of httpclose option
(httpclose/forceclose/http-server-close/etc) turned on regardless of
keep-alive status - you know, like you are always reminding people :)

I noticed when I forgot it on a section (that was not keepalive
related) it caused wacky results - hanging browsers,
images/icons/css not showing up, etc. Obviously it should not affect
single requests like you would assume Akamai would be sending, it
was a pure guess.


Thank you everyone for your feedback.  I really appreciate your help.

Sorry for taking so long to respond.  I had to get permission from my
director to post some of the log data and our haproxy configuration
file.  I also had to hide a bit more of the configuration than was
suggested because of concerns about making the issues we're encountering
too public.  I hope you understand.

 From my research on HAProxy and high availability websites in general,
it seemed to me that compared to other websites, our traffic volume is
actually rather light.  In addition to how we have configured HAProxy
for our infrastructure, I'm definitely also taking a look at our
application servers and our content as well.

I started looking at the log files and the HAProxy configuration file
more closely today.
I attached the (poorly) cleaned HAProxy configuration file.  Looking at
it, I can already see that the httpclose option isn't consistently
included in all the sections, both the frontend and the backend.  I will
make sure this option is in all sections.  Should I also add this to the
global settings for HAProxy?  Is it okay if this option is listed more
than once in a section (I noticed that this happened a couple of times)?


Chin Yin, Xani was right, please take a look at your logs. Also,
sending
us your config would help a lot. Replace IP addresses and
passwords with
"XXX" if you want, we'll comment on the rest. BTW you should
tel

Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-18 Thread Chih Yin
2010/5/17 XANi 

>  Dnia 2010-05-17, pon o godzinie 14:45 -0700, Chih Yin pisze:
>
> Hi,
>
>
>
>Please excuse me if the information contained in this email is a bit
> generic.  I'm not the regular administrator, but I've been given the task of
> troubleshooting some issues with my website.  If more details are needed,
> I'll gladly go look for additional information.
>
>
>
>Currently, my website is experiencing a lot of errors and slowness.
>  The server errors I see in the HAProxy log file are mainly 503 errors.  The
> slowness of page loads for my website can be as long as minutes.  I'm trying
> to determine if anyone have had any similar issues with using HAProxy as a
> high availability load balancer.
>
>
>
>HAProxy 1.3.21
>
>CentOS running on Citrix XenServer
>
>HP blades
>
>
>
>There are actually almost 100 virtual servers running on the blades.  A
> good many of the virtual servers are application servers running Glassfish.
>  There are a few servers dedicated to CAS for authentication and access.  I
> have three servers running Rsyslogd for writing HAProxy log data to file.  A
> NetApp filer is used for storage.
>
>
>
>Currently, the website gets about:
>
>
>
>  73,000 pageviews a day
>
>  32,000 unique visitors  a day
>
>  46,000 visit a day
>
>
>
>  3,000 pageviews a hr
>
>  1,300 unique visitors a hr
>
>  1,000 visit a hr
>
>
>
>I am using Akamai to help manage content delivery.
>
>
>
>One of the things Akamai is reporting to me is that they are having
> difficulty requesting content that needs to be refreshed.  Akamai tries up
> to 4 times to get the content with a 2 second timeout to update content
> whose TTL has expired.  After the 4th time, Akamai looks to their own cache
> before returning a 503 error to the user if the content is not available in
> the cache.
>
>
>
>Recently, I've noticed that Akamai is encountering an increasingly
> large number of 503 and 404 errors from my website.  I've traced the 404
> errors to missing images, but I'm not sure what the cause of the 503 errors
> could be.  I had some external resources help me verify that they are able
> to retrieve the content from the Glassfish application servers even when
> HAProxy is reporting the 503 errors.
>
>
>
>One thing I did notice about the HAProxy configuration is that there
> are actually three servers running HAProxy with identical configurations.
>  One serves as the primary high availability load balancer while the other
> two act as failovers.  The keep-alive daemons are configured to accomodate
> that setup.
>
>
>
>From this generic description, is there something in the way this
> architecture is set up or in the configuration of HAProxy that may be
> causing the 503 errors to be reported to Akamai?  As I mentioned, when an
> external resource makes a request for the same content directly from the
> application server, the same errors do not appear to occur.
>
>
>  503 would (usually) mean haproxy sees backends as DOWN, look for msg about
> server goin up/down in haproxy logs.
> Or, if thats not a case, grep haproxy logs for those 503 errors (make sure
> ure using http log mode), then go to section 8 in
> http://haproxy.1wt.eu/download/1.3/doc/configuration.txt and try to
> determine what was exact reason for error and/or post few examples here,
> your haproxy config (with "sensitive" information removed ofc ;) )
>
> Having some kind of monitoring, or at least stats page active is also very
> helpful.
>

Hi Mariusz,

  Thank you for the link.  I'm reviewing the documentation now.  Hopefully,
it'll provide some additional insights for me.  I do see in my configuration
that HAProxy is using httplog, but that's only specified in the front end.
 Is that correct?

  The status page for HAProxy is active, and I don'e believe we're seeing
servers reported as being down during the times when we encounter the 503
errors.  We are using zabbix to monitor our virtual servers with mixed
results.  I'm currently not involved with the configuration of zabbix, so I
cannot comment on why we don't always get notifications during a time of
increased numbers of 503 errors.

  I've attached a (hopefully) cleaned copy of our HAProxy configuration file
in a separate email.  I didn't want to spam the list with multiple
attachments of the same thing.

Thank you for your help,
C.Y.



>   --
> Mariusz Gronczewski (XANi) 
> GnuPG: 0xEA8ACE64http://devrandom.pl
>
>


Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-17 Thread Hank A. Paulson

On 5/17/10 10:24 PM, Willy Tarreau wrote:

On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote:

I have some sites running a similar set up - Xen domU, keepalived,
fedora not RHEL and they get 50+ million hits per day with pretty
fast response. you might want to use the  "log separate errors" (sp?)
option and review those 50X errors carefully, you might see a pattern
- do you have http-close* in all you configs? That got me weird, slow
results when I missed it once.


Indeed, that *could* be a possibility if combined with a server maxconn
because connections would be kept for a long time on the server (waiting
for either the client or the server to close) and during that time nobody
else could connect. The typical problem with keep-alive to the servers in
fact. The 503 could be caused by requests waiting too long in the queue
then.


My example was just to assure Chin Yin that haproxy on xen should be able to 
handle his current load depending, of course, on the glassfish servers.


I meant some kind of httpclose option 
(httpclose/forceclose/http-server-close/etc) turned on regardless of 
keep-alive status - you know, like you are always reminding people :)


I noticed when I forgot it on a section (that was not keepalive related) it 
caused wacky results - hanging browsers, images/icons/css not showing up, etc. 
Obviously it should not affect single requests like you would assume Akamai 
would be sending, it was a pure guess.




Chin Yin, Xani was right, please take a look at your logs. Also, sending
us your config would help a lot. Replace IP addresses and passwords with
"XXX" if you want, we'll comment on the rest. BTW you should tell your
admin that 1.3.21 has an annoying bug which makes it crash when connecting
to the stats socket. Thus, this reduces your possibilities of debugging it.
When you have some time, you should upgrade it to 1.3.22 or later (1.3.24)
which fix a small number of remaining bugs.


example stats page screenshot attached.


Nice stats Hank :-)


That is just the page frames (mostly) not including images, css, js, static 
icons or any other "stuff" but neither is it just for one day, it is longer.




Cheers,
Willy





Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-17 Thread Willy Tarreau
On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote:
> I have some sites running a similar set up - Xen domU, keepalived,
> fedora not RHEL and they get 50+ million hits per day with pretty
> fast response. you might want to use the  "log separate errors" (sp?)
> option and review those 50X errors carefully, you might see a pattern
> - do you have http-close* in all you configs? That got me weird, slow
> results when I missed it once.

Indeed, that *could* be a possibility if combined with a server maxconn
because connections would be kept for a long time on the server (waiting
for either the client or the server to close) and during that time nobody
else could connect. The typical problem with keep-alive to the servers in
fact. The 503 could be caused by requests waiting too long in the queue
then.

Chin Yin, Xani was right, please take a look at your logs. Also, sending
us your config would help a lot. Replace IP addresses and passwords with
"XXX" if you want, we'll comment on the rest. BTW you should tell your
admin that 1.3.21 has an annoying bug which makes it crash when connecting
to the stats socket. Thus, this reduces your possibilities of debugging it.
When you have some time, you should upgrade it to 1.3.22 or later (1.3.24)
which fix a small number of remaining bugs.

> example stats page screenshot attached.

Nice stats Hank :-)

Cheers,
Willy




Re: A (Hopefully Not too Generic) Question About HAProxy

2010-05-17 Thread XANi
Dnia 2010-05-17, pon o godzinie 14:45 -0700, Chih Yin pisze:
> Hi,
> 
> 
>   Please excuse me if the information contained in this email is a bit
> generic.  I'm not the regular administrator, but I've been given the
> task of troubleshooting some issues with my website.  If more details
> are needed, I'll gladly go look for additional information.
> 
> 
>   Currently, my website is experiencing a lot of errors and slowness.
>  The server errors I see in the HAProxy log file are mainly 503
> errors.  The slowness of page loads for my website can be as long as
> minutes.  I'm trying to determine if anyone have had any similar
> issues with using HAProxy as a high availability load balancer.
> 
> 
>   HAProxy 1.3.21
>   CentOS running on Citrix XenServer
>   HP blades
> 
> 
>   There are actually almost 100 virtual servers running on the
> blades.  A good many of the virtual servers are application servers
> running Glassfish.  There are a few servers dedicated to CAS for
> authentication and access.  I have three servers running Rsyslogd for
> writing HAProxy log data to file.  A NetApp filer is used for storage.
> 
> 
>   Currently, the website gets about:
> 
> 
> 73,000 pageviews a day
> 32,000 unique visitors  a day
> 46,000 visit a day
> 
> 
> 3,000 pageviews a hr
> 1,300 unique visitors a hr
> 1,000 visit a hr
> 
> 
>   I am using Akamai to help manage content delivery.
> 
> 
>   One of the things Akamai is reporting to me is that they are having
> difficulty requesting content that needs to be refreshed.  Akamai
> tries up to 4 times to get the content with a 2 second timeout to
> update content whose TTL has expired.  After the 4th time, Akamai
> looks to their own cache before returning a 503 error to the user if
> the content is not available in the cache.
> 
> 
>   Recently, I've noticed that Akamai is encountering an increasingly
> large number of 503 and 404 errors from my website.  I've traced the
> 404 errors to missing images, but I'm not sure what the cause of the
> 503 errors could be.  I had some external resources help me verify
> that they are able to retrieve the content from the Glassfish
> application servers even when HAProxy is reporting the 503 errors.
> 
> 
>   One thing I did notice about the HAProxy configuration is that there
> are actually three servers running HAProxy with identical
> configurations.  One serves as the primary high availability load
> balancer while the other two act as failovers.  The keep-alive daemons
> are configured to accomodate that setup.
> 
> 
>   From this generic description, is there something in the way this
> architecture is set up or in the configuration of HAProxy that may be
> causing the 503 errors to be reported to Akamai?  As I mentioned, when
> an external resource makes a request for the same content directly
> from the application server, the same errors do not appear to occur.
> 

503 would (usually) mean haproxy sees backends as DOWN, look for msg
about server goin up/down in haproxy logs.
Or, if thats not a case, grep haproxy logs for those 503 errors (make
sure ure using http log mode), then go to section 8 in
http://haproxy.1wt.eu/download/1.3/doc/configuration.txt and try to
determine what was exact reason for error and/or post few examples here,
your haproxy config (with "sensitive" information removed ofc ;) )

Having some kind of monitoring, or at least stats page active is also
very helpful.

-- 
Mariusz Gronczewski (XANi) 
GnuPG: 0xEA8ACE64
http://devrandom.pl


signature.asc
Description: To jest część  wiadomości podpisana cyfrowo


A (Hopefully Not too Generic) Question About HAProxy

2010-05-17 Thread Chih Yin
Hi,

  Please excuse me if the information contained in this email is a bit
generic.  I'm not the regular administrator, but I've been given the task of
troubleshooting some issues with my website.  If more details are needed,
I'll gladly go look for additional information.

  Currently, my website is experiencing a lot of errors and slowness.  The
server errors I see in the HAProxy log file are mainly 503 errors.  The
slowness of page loads for my website can be as long as minutes.  I'm trying
to determine if anyone have had any similar issues with using HAProxy as a
high availability load balancer.

  HAProxy 1.3.21
  CentOS running on Citrix XenServer
  HP blades

  There are actually almost 100 virtual servers running on the blades.  A
good many of the virtual servers are application servers running Glassfish.
 There are a few servers dedicated to CAS for authentication and access.  I
have three servers running Rsyslogd for writing HAProxy log data to file.  A
NetApp filer is used for storage.

  Currently, the website gets about:

73,000 pageviews a day
32,000 unique visitors  a day
46,000 visit a day

3,000 pageviews a hr
1,300 unique visitors a hr
1,000 visit a hr

  I am using Akamai to help manage content delivery.

  One of the things Akamai is reporting to me is that they are having
difficulty requesting content that needs to be refreshed.  Akamai tries up
to 4 times to get the content with a 2 second timeout to update content
whose TTL has expired.  After the 4th time, Akamai looks to their own cache
before returning a 503 error to the user if the content is not available in
the cache.

  Recently, I've noticed that Akamai is encountering an increasingly large
number of 503 and 404 errors from my website.  I've traced the 404 errors to
missing images, but I'm not sure what the cause of the 503 errors could be.
 I had some external resources help me verify that they are able to retrieve
the content from the Glassfish application servers even when HAProxy is
reporting the 503 errors.

  One thing I did notice about the HAProxy configuration is that there are
actually three servers running HAProxy with identical configurations.  One
serves as the primary high availability load balancer while the other two
act as failovers.  The keep-alive daemons are configured to accomodate that
setup.

  From this generic description, is there something in the way this
architecture is set up or in the configuration of HAProxy that may be
causing the 503 errors to be reported to Akamai?  As I mentioned, when an
external resource makes a request for the same content directly from the
application server, the same errors do not appear to occur.

thanks,
C.Y.