Re: A (Hopefully Not too Generic) Question About HAProxy
On Wed, May 19, 2010 at 9:43 PM, Willy Tarreau wrote: > On Wed, May 19, 2010 at 04:49:02PM -0700, Chih Yin wrote: > > Hi Mariusz, > > > > On Wed, May 19, 2010 at 2:18 PM, Mariusz Gronczewski >wrote: > > > > > One more thing about config, u dont need to do > > > acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0 > > > acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1 > > > acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2 > > > and then > > > use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03 > > > > > > u can just do > > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.0 > > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.1 > > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.2 > > > > > > and then > > > use_backend robot_traffic if is_msn > > > > > > ACLs with same name are automatically ORed together. > > > > > > or better yet, match bots by user-agent not by IP > > > http://www.useragentstring.com/pages/useragentstring.php > > > > > > > > Thank you so much. This is definitely helpful! > > Also, since 1.3.21 you have the "hdr_ip" ACL which can parse > IP addresses from headers. What that means is that instead of > doing sub-string matching, you can match networks, which is > faster and allows globbing. For instance : > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.0 > acl is_msnhdr_sub(X-Forwarded-For) 64.4.1 > > can be replaced by : > > acl is_msnhdr_ip(X-Forwarded-For) 64.4.0.0/15 > > And with 1.4.6, you'll even be able to fill all known networks > in a file and load them in one line : > > acl is_msnhdr_ip(X-Forwarded-For) -f /etc/haproxy/msn_networks.txt > > Thank you for the suggestion Willy. I will definitely give this a try as well. C.Y. > Willy > >
Re: A (Hopefully Not too Generic) Question About HAProxy
Hi Willy, On Wed, May 19, 2010 at 9:39 PM, Willy Tarreau wrote: > Hi Chih Yin, > > On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote: > > > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote: > > > > As for the logs, it seems that I'll need to look at the configuration > for > > > > HAProxy a bit more to make some adjustments first. A few months > back, I > > > > know I saw messages indicating the status of server (e.g. 3 active, 2 > > > > backup). > > > > > > Normally this means that a server is failing to respond to some health > > > checks, > > > either because it crashed or froze, or because it's overloaded. > > > > > > > > Wow. I'm growing concerned with this. What I've noticed is that these > > messages were encountered almost daily for almost a year, but disappeared > > since we migrated to the blade servers. The disconcerting part is that > > since we made that migration, all indications is that the virtual servers > > have been less reliable than before. Yet, I haven't seen these messages > at > > all. > > And most likely it is because you don't have a separate log anymore that > you don't see the messages. Please try a simple test on your logs : look > for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to > look for the 'is' word surrounded with spaces : > > $ fgrep ' is ' haproxy.log > > You can even check for messages indicating that you have lost your last > server : > > $ fgrep ' has no server ' haproxy.log > > If your logs have not been filtered out, you should find these events. > > I ran both commands and did not get any results. It would seem that I need to search for other locations where this information might be kept. > > > What I see is that your "contimeout" is set to 8 seconds and you have > no > > > "timeout queue". In this case, the queue timeout defaults to the > > > contimeout, > > > which is rather short. It means that when all your servers are > saturated, a > > > request will go to the queue and if no server releases a connection > within > > > 8 seconds, the client will get a 503. At least you should add > > > "timeout queue 80s" to give more chances to your new client requests to > get > > > served within the previous requests' timeout. While this is a very high > > > timer > > > it might help troubleshoot your issues. > > > > > > > > I guess I'm a bit confused. In the configuration file, I see the > following > > in the defaults section: > > > > defaults > > modehttp > > maxconn 1024 > > *contimeout 8000* > > clitimeout 8 > > srvtimeout 8 > > *timeout queue 5* > > Ah yes, sorry about that, I missed it when quickly reviewing your > config. Maybe because of the mixed syntax. So that means that your > users will wait up to 50s in the queue, which should be more than > enough. So most likely the 503s are only caused by cases where you > don't have any remaining server up. > > One important point I've just noticed : you don't have > "option abortonclose". You should definitely have it with > that long timeouts, because there are high chances that most > users won't wait that long or will click the reload button while > their request is in the queue. With that option enabled, the old > pending request will be aborted if the user clicks stop or reload. > This is important, otherwise you could get a lot of requests in > queue if the guy clicks reload 10 times in a row. > > Thank you. I have a feeling this will be a very helpful suggestion. From what I observe of how our internal users behave on the website, this change will have a tremendous impact. > > Am I misunderstanding and looking at the wrong spot? Also, is there a > > standard timeout for the queue that is reasonable, or would this be a > value > > that varies from website to website? > > It varies from site to site, and should reflect the maximum time > you think a user will accept to wait. But a good guess is to use > the same value as the server timeout because it should also be set > to the maximum time a user will accept to wait :-) > > At this point, I'm very grateful for our users, most of whom seem to have infinite patience. :) > But you should be aware that 50 or 80 seconds are extremely long. > Some sites require that large timeouts for a very specific request > which can take a long time, but your average request time should > be below the hundreds of milliseconds for dynamic objects and > around the millisecond for static objects. I suggest that you pass > "halog -pct" on your logs, it will show you how your response times > are spread. > > I am trying to think of possible reasons that the timeout was set to 50 and 80 seconds. The only think I can think of is that there is a lot of inter-server traffic occurring to respond to some of the requests. Maybe the timeout was set so that for some of this content, the initial request will not timeout while waiting for the back-end servers to
Re: A (Hopefully Not too Generic) Question About HAProxy
On Wed, May 19, 2010 at 04:49:02PM -0700, Chih Yin wrote: > Hi Mariusz, > > On Wed, May 19, 2010 at 2:18 PM, Mariusz Gronczewski wrote: > > > One more thing about config, u dont need to do > > acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0 > > acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1 > > acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2 > > and then > > use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03 > > > > u can just do > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.0 > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.1 > > acl is_msnhdr_sub(X-Forwarded-For) 64.4.2 > > > > and then > > use_backend robot_traffic if is_msn > > > > ACLs with same name are automatically ORed together. > > > > or better yet, match bots by user-agent not by IP > > http://www.useragentstring.com/pages/useragentstring.php > > > > > Thank you so much. This is definitely helpful! Also, since 1.3.21 you have the "hdr_ip" ACL which can parse IP addresses from headers. What that means is that instead of doing sub-string matching, you can match networks, which is faster and allows globbing. For instance : acl is_msnhdr_sub(X-Forwarded-For) 64.4.0 acl is_msnhdr_sub(X-Forwarded-For) 64.4.1 can be replaced by : acl is_msnhdr_ip(X-Forwarded-For) 64.4.0.0/15 And with 1.4.6, you'll even be able to fill all known networks in a file and load them in one line : acl is_msnhdr_ip(X-Forwarded-For) -f /etc/haproxy/msn_networks.txt Willy
Re: A (Hopefully Not too Generic) Question About HAProxy
Hi Chih Yin, On Wed, May 19, 2010 at 04:47:00PM -0700, Chih Yin wrote: > > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote: > > > As for the logs, it seems that I'll need to look at the configuration for > > > HAProxy a bit more to make some adjustments first. A few months back, I > > > know I saw messages indicating the status of server (e.g. 3 active, 2 > > > backup). > > > > Normally this means that a server is failing to respond to some health > > checks, > > either because it crashed or froze, or because it's overloaded. > > > > > Wow. I'm growing concerned with this. What I've noticed is that these > messages were encountered almost daily for almost a year, but disappeared > since we migrated to the blade servers. The disconcerting part is that > since we made that migration, all indications is that the virtual servers > have been less reliable than before. Yet, I haven't seen these messages at > all. And most likely it is because you don't have a separate log anymore that you don't see the messages. Please try a simple test on your logs : look for messages "Server xxx/yyy is UP" (or DOWN). In practice it's enough to look for the 'is' word surrounded with spaces : $ fgrep ' is ' haproxy.log You can even check for messages indicating that you have lost your last server : $ fgrep ' has no server ' haproxy.log If your logs have not been filtered out, you should find these events. > > What I see is that your "contimeout" is set to 8 seconds and you have no > > "timeout queue". In this case, the queue timeout defaults to the > > contimeout, > > which is rather short. It means that when all your servers are saturated, a > > request will go to the queue and if no server releases a connection within > > 8 seconds, the client will get a 503. At least you should add > > "timeout queue 80s" to give more chances to your new client requests to get > > served within the previous requests' timeout. While this is a very high > > timer > > it might help troubleshoot your issues. > > > > > I guess I'm a bit confused. In the configuration file, I see the following > in the defaults section: > > defaults > modehttp > maxconn 1024 > *contimeout 8000* > clitimeout 8 > srvtimeout 8 > *timeout queue 5* Ah yes, sorry about that, I missed it when quickly reviewing your config. Maybe because of the mixed syntax. So that means that your users will wait up to 50s in the queue, which should be more than enough. So most likely the 503s are only caused by cases where you don't have any remaining server up. One important point I've just noticed : you don't have "option abortonclose". You should definitely have it with that long timeouts, because there are high chances that most users won't wait that long or will click the reload button while their request is in the queue. With that option enabled, the old pending request will be aborted if the user clicks stop or reload. This is important, otherwise you could get a lot of requests in queue if the guy clicks reload 10 times in a row. > Am I misunderstanding and looking at the wrong spot? Also, is there a > standard timeout for the queue that is reasonable, or would this be a value > that varies from website to website? It varies from site to site, and should reflect the maximum time you think a user will accept to wait. But a good guess is to use the same value as the server timeout because it should also be set to the maximum time a user will accept to wait :-) But you should be aware that 50 or 80 seconds are extremely long. Some sites require that large timeouts for a very specific request which can take a long time, but your average request time should be below the hundreds of milliseconds for dynamic objects and around the millisecond for static objects. I suggest that you pass "halog -pct" on your logs, it will show you how your response times are spread. Regards, Willy
Re: A (Hopefully Not too Generic) Question About HAProxy
Hi Mariusz, On Wed, May 19, 2010 at 2:18 PM, Mariusz Gronczewski wrote: > One more thing about config, u dont need to do > acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0 > acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1 > acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2 > and then > use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03 > > u can just do > acl is_msnhdr_sub(X-Forwarded-For) 64.4.0 > acl is_msnhdr_sub(X-Forwarded-For) 64.4.1 > acl is_msnhdr_sub(X-Forwarded-For) 64.4.2 > > and then > use_backend robot_traffic if is_msn > > ACLs with same name are automatically ORed together. > > or better yet, match bots by user-agent not by IP > http://www.useragentstring.com/pages/useragentstring.php > > Thank you so much. This is definitely helpful! C.Y.
Re: A (Hopefully Not too Generic) Question About HAProxy
Hi Willy, On Wed, May 19, 2010 at 5:16 AM, Willy Tarreau wrote: > Hi, > > I have a quick comment for now, before going deep through your config. > > On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote: > > As for the logs, it seems that I'll need to look at the configuration for > > HAProxy a bit more to make some adjustments first. A few months back, I > > know I saw messages indicating the status of server (e.g. 3 active, 2 > > backup). > > Normally this means that a server is failing to respond to some health > checks, > either because it crashed or froze, or because it's overloaded. > > Wow. I'm growing concerned with this. What I've noticed is that these messages were encountered almost daily for almost a year, but disappeared since we migrated to the blade servers. The disconcerting part is that since we made that migration, all indications is that the virtual servers have been less reliable than before. Yet, I haven't seen these messages at all. > What I see is that your "contimeout" is set to 8 seconds and you have no > "timeout queue". In this case, the queue timeout defaults to the > contimeout, > which is rather short. It means that when all your servers are saturated, a > request will go to the queue and if no server releases a connection within > 8 seconds, the client will get a 503. At least you should add > "timeout queue 80s" to give more chances to your new client requests to get > served within the previous requests' timeout. While this is a very high > timer > it might help troubleshoot your issues. > > I guess I'm a bit confused. In the configuration file, I see the following in the defaults section: defaults modehttp maxconn 1024 *contimeout 8000* clitimeout 8 srvtimeout 8 *timeout queue 5* Am I misunderstanding and looking at the wrong spot? Also, is there a standard timeout for the queue that is reasonable, or would this be a value that varies from website to website? Thank you, C.Y. > Regards, > Willy > >
Re: A (Hopefully Not too Generic) Question About HAProxy
Hi Hank, On Tue, May 18, 2010 at 7:45 PM, Hank A. Paulson < h...@spamproof.nospammail.net> wrote: > You do have bunch of services that are http mode that don't seem to have > any type of http close. Some I don't understand why they are not http mode > and they probably should be. > > Just a note you may be able to greatly simplify (and possibly speed up) > your config using the new capabilities for tables of IPs added in 1.4.6. > > solr should probably be http mode and anywhere else that you have http mode > you probably want an http close option turned on. > > I am not sure why they chose dispatch for the prod glassfish server, my > guess is they are running apache and mod_jk or something and then forwarding > the requests to different glassfish servers - are there really more than one > prod glassfish servers? I am wondering if the previous admin set up more > than one copy of haproxy and that is why several services are redirected to > the same machine - like glassfish prod there is no other reference to port > 4850 in this config, so what is running on port 4850? haproxy/apache/heaven > forbid - glassfish itself? netstat -antope | fgrep LIST | fgrep 4850 > > I think one of the problems is the "inter_server" it doesn't have http mode > set so if more than one hit/request comes in on an open connection then your > request parsing rules are not run on any requests except the first one (as > Wille keeps reminding people). That might work ok for most things since you > are mostly breaking things up by service liferay goes to the liferay > servers, etc - the problem comes in if you have a portal that people sign > into and then have a menu/navbar that they can choose different services > that should be going to different front/backends. > > Thank you for these suggestions. I also saw your follow-up email. I will start applying and testing the changes you recommended. I think your comments on the "inter_server" is probably close to the truth. I'm definitely seeing some unexplained behavior from our website when users are utilizing different services. I've also made the recommendation to my director to evaluate version 1.4.6 while we plan the upgrade from 1.3.21 to 1.3.22. > > On 5/18/10 3:49 PM, Chih Yin wrote: > >> >> >> On Mon, May 17, 2010 at 11:11 PM, Hank A. Paulson >> mailto:h...@spamproof.nospammail.net>> >> wrote: >> >>On 5/17/10 10:24 PM, Willy Tarreau wrote: >> >>On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote: >> >>I have some sites running a similar set up - Xen domU, >>keepalived, >>fedora not RHEL and they get 50+ million hits per day with >>pretty >>fast response. you might want to use the "log separate >>errors" (sp?) >>option and review those 50X errors carefully, you might see >>a pattern >>- do you have http-close* in all you configs? That got me >>weird, slow >>results when I missed it once. >> >> >>Indeed, that *could* be a possibility if combined with a server >>maxconn >>because connections would be kept for a long time on the server >>(waiting >>for either the client or the server to close) and during that >>time nobody >>else could connect. The typical problem with keep-alive to the >>servers in >>fact. The 503 could be caused by requests waiting too long in >>the queue >>then. >> >> >>My example was just to assure Chin Yin that haproxy on xen should be >>able to handle his current load depending, of course, on the >>glassfish servers. >> >>I meant some kind of httpclose option >>(httpclose/forceclose/http-server-close/etc) turned on regardless of >>keep-alive status - you know, like you are always reminding people :) >> >>I noticed when I forgot it on a section (that was not keepalive >>related) it caused wacky results - hanging browsers, >>images/icons/css not showing up, etc. Obviously it should not affect >>single requests like you would assume Akamai would be sending, it >>was a pure guess. >> >> >> Thank you everyone for your feedback. I really appreciate your help. >> >> Sorry for taking so long to respond. I had to get permission from my >> director to post some of the log data and our haproxy configuration >> file. I also had to hide a bit more of the configuration than was >> suggested because of concerns about making the issues we're encountering >> too public. I hope you understand. >> >> From my research on HAProxy and high availability websites in general, >> it seemed to me that compared to other websites, our traffic volume is >> actually rather light. In addition to how we have configured HAProxy >> for our infrastructure, I'm definitely also taking a look at our >> application servers and our content as well. >> >> I started looking at the log files and the HAProxy
Re: A (Hopefully Not too Generic) Question About HAProxy
One more thing about config, u dont need to do acl is_msn01hdr_sub(X-Forwarded-For) 64.4.0 acl is_msn02hdr_sub(X-Forwarded-For) 64.4.1 acl is_msn03hdr_sub(X-Forwarded-For) 64.4.2 and then use_backend robot_traffic if is_msn01 or is_msn02 or is_msn03 u can just do acl is_msnhdr_sub(X-Forwarded-For) 64.4.0 acl is_msnhdr_sub(X-Forwarded-For) 64.4.1 acl is_msnhdr_sub(X-Forwarded-For) 64.4.2 and then use_backend robot_traffic if is_msn ACLs with same name are automatically ORed together. or better yet, match bots by user-agent not by IP http://www.useragentstring.com/pages/useragentstring.php
Re: A (Hopefully Not too Generic) Question About HAProxy
Hi, I have a quick comment for now, before going deep through your config. On Tue, May 18, 2010 at 03:49:57PM -0700, Chih Yin wrote: > As for the logs, it seems that I'll need to look at the configuration for > HAProxy a bit more to make some adjustments first. A few months back, I > know I saw messages indicating the status of server (e.g. 3 active, 2 > backup). Normally this means that a server is failing to respond to some health checks, either because it crashed or froze, or because it's overloaded. What I see is that your "contimeout" is set to 8 seconds and you have no "timeout queue". In this case, the queue timeout defaults to the contimeout, which is rather short. It means that when all your servers are saturated, a request will go to the queue and if no server releases a connection within 8 seconds, the client will get a 503. At least you should add "timeout queue 80s" to give more chances to your new client requests to get served within the previous requests' timeout. While this is a very high timer it might help troubleshoot your issues. Regards, Willy
Re: A (Hopefully Not too Generic) Question About HAProxy
On 5/18/10 7:45 PM, Hank A. Paulson wrote: I am wondering if the previous admin set up more than one copy of haproxy and that is why several services are redirected to the same machine - like glassfish prod there is no other reference to port 4850 in this config, so what is running on port 4850? haproxy/apache/heaven forbid - glassfish itself? netstat -antope | fgrep LIST | fgrep 4850 oops, my bad glassfish prod is on a different server - never mind... 172.16.163.1:4850
Re: A (Hopefully Not too Generic) Question About HAProxy
You do have bunch of services that are http mode that don't seem to have any type of http close. Some I don't understand why they are not http mode and they probably should be. Just a note you may be able to greatly simplify (and possibly speed up) your config using the new capabilities for tables of IPs added in 1.4.6. solr should probably be http mode and anywhere else that you have http mode you probably want an http close option turned on. I am not sure why they chose dispatch for the prod glassfish server, my guess is they are running apache and mod_jk or something and then forwarding the requests to different glassfish servers - are there really more than one prod glassfish servers? I am wondering if the previous admin set up more than one copy of haproxy and that is why several services are redirected to the same machine - like glassfish prod there is no other reference to port 4850 in this config, so what is running on port 4850? haproxy/apache/heaven forbid - glassfish itself? netstat -antope | fgrep LIST | fgrep 4850 I think one of the problems is the "inter_server" it doesn't have http mode set so if more than one hit/request comes in on an open connection then your request parsing rules are not run on any requests except the first one (as Wille keeps reminding people). That might work ok for most things since you are mostly breaking things up by service liferay goes to the liferay servers, etc - the problem comes in if you have a portal that people sign into and then have a menu/navbar that they can choose different services that should be going to different front/backends. On 5/18/10 3:49 PM, Chih Yin wrote: On Mon, May 17, 2010 at 11:11 PM, Hank A. Paulson mailto:h...@spamproof.nospammail.net>> wrote: On 5/17/10 10:24 PM, Willy Tarreau wrote: On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote: I have some sites running a similar set up - Xen domU, keepalived, fedora not RHEL and they get 50+ million hits per day with pretty fast response. you might want to use the "log separate errors" (sp?) option and review those 50X errors carefully, you might see a pattern - do you have http-close* in all you configs? That got me weird, slow results when I missed it once. Indeed, that *could* be a possibility if combined with a server maxconn because connections would be kept for a long time on the server (waiting for either the client or the server to close) and during that time nobody else could connect. The typical problem with keep-alive to the servers in fact. The 503 could be caused by requests waiting too long in the queue then. My example was just to assure Chin Yin that haproxy on xen should be able to handle his current load depending, of course, on the glassfish servers. I meant some kind of httpclose option (httpclose/forceclose/http-server-close/etc) turned on regardless of keep-alive status - you know, like you are always reminding people :) I noticed when I forgot it on a section (that was not keepalive related) it caused wacky results - hanging browsers, images/icons/css not showing up, etc. Obviously it should not affect single requests like you would assume Akamai would be sending, it was a pure guess. Thank you everyone for your feedback. I really appreciate your help. Sorry for taking so long to respond. I had to get permission from my director to post some of the log data and our haproxy configuration file. I also had to hide a bit more of the configuration than was suggested because of concerns about making the issues we're encountering too public. I hope you understand. From my research on HAProxy and high availability websites in general, it seemed to me that compared to other websites, our traffic volume is actually rather light. In addition to how we have configured HAProxy for our infrastructure, I'm definitely also taking a look at our application servers and our content as well. I started looking at the log files and the HAProxy configuration file more closely today. I attached the (poorly) cleaned HAProxy configuration file. Looking at it, I can already see that the httpclose option isn't consistently included in all the sections, both the frontend and the backend. I will make sure this option is in all sections. Should I also add this to the global settings for HAProxy? Is it okay if this option is listed more than once in a section (I noticed that this happened a couple of times)? Chin Yin, Xani was right, please take a look at your logs. Also, sending us your config would help a lot. Replace IP addresses and passwords with "XXX" if you want, we'll comment on the rest. BTW you should tel
Re: A (Hopefully Not too Generic) Question About HAProxy
2010/5/17 XANi > Dnia 2010-05-17, pon o godzinie 14:45 -0700, Chih Yin pisze: > > Hi, > > > >Please excuse me if the information contained in this email is a bit > generic. I'm not the regular administrator, but I've been given the task of > troubleshooting some issues with my website. If more details are needed, > I'll gladly go look for additional information. > > > >Currently, my website is experiencing a lot of errors and slowness. > The server errors I see in the HAProxy log file are mainly 503 errors. The > slowness of page loads for my website can be as long as minutes. I'm trying > to determine if anyone have had any similar issues with using HAProxy as a > high availability load balancer. > > > >HAProxy 1.3.21 > >CentOS running on Citrix XenServer > >HP blades > > > >There are actually almost 100 virtual servers running on the blades. A > good many of the virtual servers are application servers running Glassfish. > There are a few servers dedicated to CAS for authentication and access. I > have three servers running Rsyslogd for writing HAProxy log data to file. A > NetApp filer is used for storage. > > > >Currently, the website gets about: > > > > 73,000 pageviews a day > > 32,000 unique visitors a day > > 46,000 visit a day > > > > 3,000 pageviews a hr > > 1,300 unique visitors a hr > > 1,000 visit a hr > > > >I am using Akamai to help manage content delivery. > > > >One of the things Akamai is reporting to me is that they are having > difficulty requesting content that needs to be refreshed. Akamai tries up > to 4 times to get the content with a 2 second timeout to update content > whose TTL has expired. After the 4th time, Akamai looks to their own cache > before returning a 503 error to the user if the content is not available in > the cache. > > > >Recently, I've noticed that Akamai is encountering an increasingly > large number of 503 and 404 errors from my website. I've traced the 404 > errors to missing images, but I'm not sure what the cause of the 503 errors > could be. I had some external resources help me verify that they are able > to retrieve the content from the Glassfish application servers even when > HAProxy is reporting the 503 errors. > > > >One thing I did notice about the HAProxy configuration is that there > are actually three servers running HAProxy with identical configurations. > One serves as the primary high availability load balancer while the other > two act as failovers. The keep-alive daemons are configured to accomodate > that setup. > > > >From this generic description, is there something in the way this > architecture is set up or in the configuration of HAProxy that may be > causing the 503 errors to be reported to Akamai? As I mentioned, when an > external resource makes a request for the same content directly from the > application server, the same errors do not appear to occur. > > > 503 would (usually) mean haproxy sees backends as DOWN, look for msg about > server goin up/down in haproxy logs. > Or, if thats not a case, grep haproxy logs for those 503 errors (make sure > ure using http log mode), then go to section 8 in > http://haproxy.1wt.eu/download/1.3/doc/configuration.txt and try to > determine what was exact reason for error and/or post few examples here, > your haproxy config (with "sensitive" information removed ofc ;) ) > > Having some kind of monitoring, or at least stats page active is also very > helpful. > Hi Mariusz, Thank you for the link. I'm reviewing the documentation now. Hopefully, it'll provide some additional insights for me. I do see in my configuration that HAProxy is using httplog, but that's only specified in the front end. Is that correct? The status page for HAProxy is active, and I don'e believe we're seeing servers reported as being down during the times when we encounter the 503 errors. We are using zabbix to monitor our virtual servers with mixed results. I'm currently not involved with the configuration of zabbix, so I cannot comment on why we don't always get notifications during a time of increased numbers of 503 errors. I've attached a (hopefully) cleaned copy of our HAProxy configuration file in a separate email. I didn't want to spam the list with multiple attachments of the same thing. Thank you for your help, C.Y. > -- > Mariusz Gronczewski (XANi) > GnuPG: 0xEA8ACE64http://devrandom.pl > >
Re: A (Hopefully Not too Generic) Question About HAProxy
On 5/17/10 10:24 PM, Willy Tarreau wrote: On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote: I have some sites running a similar set up - Xen domU, keepalived, fedora not RHEL and they get 50+ million hits per day with pretty fast response. you might want to use the "log separate errors" (sp?) option and review those 50X errors carefully, you might see a pattern - do you have http-close* in all you configs? That got me weird, slow results when I missed it once. Indeed, that *could* be a possibility if combined with a server maxconn because connections would be kept for a long time on the server (waiting for either the client or the server to close) and during that time nobody else could connect. The typical problem with keep-alive to the servers in fact. The 503 could be caused by requests waiting too long in the queue then. My example was just to assure Chin Yin that haproxy on xen should be able to handle his current load depending, of course, on the glassfish servers. I meant some kind of httpclose option (httpclose/forceclose/http-server-close/etc) turned on regardless of keep-alive status - you know, like you are always reminding people :) I noticed when I forgot it on a section (that was not keepalive related) it caused wacky results - hanging browsers, images/icons/css not showing up, etc. Obviously it should not affect single requests like you would assume Akamai would be sending, it was a pure guess. Chin Yin, Xani was right, please take a look at your logs. Also, sending us your config would help a lot. Replace IP addresses and passwords with "XXX" if you want, we'll comment on the rest. BTW you should tell your admin that 1.3.21 has an annoying bug which makes it crash when connecting to the stats socket. Thus, this reduces your possibilities of debugging it. When you have some time, you should upgrade it to 1.3.22 or later (1.3.24) which fix a small number of remaining bugs. example stats page screenshot attached. Nice stats Hank :-) That is just the page frames (mostly) not including images, css, js, static icons or any other "stuff" but neither is it just for one day, it is longer. Cheers, Willy
Re: A (Hopefully Not too Generic) Question About HAProxy
On Mon, May 17, 2010 at 07:42:03PM -0700, Hank A. Paulson wrote: > I have some sites running a similar set up - Xen domU, keepalived, > fedora not RHEL and they get 50+ million hits per day with pretty > fast response. you might want to use the "log separate errors" (sp?) > option and review those 50X errors carefully, you might see a pattern > - do you have http-close* in all you configs? That got me weird, slow > results when I missed it once. Indeed, that *could* be a possibility if combined with a server maxconn because connections would be kept for a long time on the server (waiting for either the client or the server to close) and during that time nobody else could connect. The typical problem with keep-alive to the servers in fact. The 503 could be caused by requests waiting too long in the queue then. Chin Yin, Xani was right, please take a look at your logs. Also, sending us your config would help a lot. Replace IP addresses and passwords with "XXX" if you want, we'll comment on the rest. BTW you should tell your admin that 1.3.21 has an annoying bug which makes it crash when connecting to the stats socket. Thus, this reduces your possibilities of debugging it. When you have some time, you should upgrade it to 1.3.22 or later (1.3.24) which fix a small number of remaining bugs. > example stats page screenshot attached. Nice stats Hank :-) Cheers, Willy
Re: A (Hopefully Not too Generic) Question About HAProxy
Dnia 2010-05-17, pon o godzinie 14:45 -0700, Chih Yin pisze: > Hi, > > > Please excuse me if the information contained in this email is a bit > generic. I'm not the regular administrator, but I've been given the > task of troubleshooting some issues with my website. If more details > are needed, I'll gladly go look for additional information. > > > Currently, my website is experiencing a lot of errors and slowness. > The server errors I see in the HAProxy log file are mainly 503 > errors. The slowness of page loads for my website can be as long as > minutes. I'm trying to determine if anyone have had any similar > issues with using HAProxy as a high availability load balancer. > > > HAProxy 1.3.21 > CentOS running on Citrix XenServer > HP blades > > > There are actually almost 100 virtual servers running on the > blades. A good many of the virtual servers are application servers > running Glassfish. There are a few servers dedicated to CAS for > authentication and access. I have three servers running Rsyslogd for > writing HAProxy log data to file. A NetApp filer is used for storage. > > > Currently, the website gets about: > > > 73,000 pageviews a day > 32,000 unique visitors a day > 46,000 visit a day > > > 3,000 pageviews a hr > 1,300 unique visitors a hr > 1,000 visit a hr > > > I am using Akamai to help manage content delivery. > > > One of the things Akamai is reporting to me is that they are having > difficulty requesting content that needs to be refreshed. Akamai > tries up to 4 times to get the content with a 2 second timeout to > update content whose TTL has expired. After the 4th time, Akamai > looks to their own cache before returning a 503 error to the user if > the content is not available in the cache. > > > Recently, I've noticed that Akamai is encountering an increasingly > large number of 503 and 404 errors from my website. I've traced the > 404 errors to missing images, but I'm not sure what the cause of the > 503 errors could be. I had some external resources help me verify > that they are able to retrieve the content from the Glassfish > application servers even when HAProxy is reporting the 503 errors. > > > One thing I did notice about the HAProxy configuration is that there > are actually three servers running HAProxy with identical > configurations. One serves as the primary high availability load > balancer while the other two act as failovers. The keep-alive daemons > are configured to accomodate that setup. > > > From this generic description, is there something in the way this > architecture is set up or in the configuration of HAProxy that may be > causing the 503 errors to be reported to Akamai? As I mentioned, when > an external resource makes a request for the same content directly > from the application server, the same errors do not appear to occur. > 503 would (usually) mean haproxy sees backends as DOWN, look for msg about server goin up/down in haproxy logs. Or, if thats not a case, grep haproxy logs for those 503 errors (make sure ure using http log mode), then go to section 8 in http://haproxy.1wt.eu/download/1.3/doc/configuration.txt and try to determine what was exact reason for error and/or post few examples here, your haproxy config (with "sensitive" information removed ofc ;) ) Having some kind of monitoring, or at least stats page active is also very helpful. -- Mariusz Gronczewski (XANi) GnuPG: 0xEA8ACE64 http://devrandom.pl signature.asc Description: To jest część wiadomości podpisana cyfrowo
A (Hopefully Not too Generic) Question About HAProxy
Hi, Please excuse me if the information contained in this email is a bit generic. I'm not the regular administrator, but I've been given the task of troubleshooting some issues with my website. If more details are needed, I'll gladly go look for additional information. Currently, my website is experiencing a lot of errors and slowness. The server errors I see in the HAProxy log file are mainly 503 errors. The slowness of page loads for my website can be as long as minutes. I'm trying to determine if anyone have had any similar issues with using HAProxy as a high availability load balancer. HAProxy 1.3.21 CentOS running on Citrix XenServer HP blades There are actually almost 100 virtual servers running on the blades. A good many of the virtual servers are application servers running Glassfish. There are a few servers dedicated to CAS for authentication and access. I have three servers running Rsyslogd for writing HAProxy log data to file. A NetApp filer is used for storage. Currently, the website gets about: 73,000 pageviews a day 32,000 unique visitors a day 46,000 visit a day 3,000 pageviews a hr 1,300 unique visitors a hr 1,000 visit a hr I am using Akamai to help manage content delivery. One of the things Akamai is reporting to me is that they are having difficulty requesting content that needs to be refreshed. Akamai tries up to 4 times to get the content with a 2 second timeout to update content whose TTL has expired. After the 4th time, Akamai looks to their own cache before returning a 503 error to the user if the content is not available in the cache. Recently, I've noticed that Akamai is encountering an increasingly large number of 503 and 404 errors from my website. I've traced the 404 errors to missing images, but I'm not sure what the cause of the 503 errors could be. I had some external resources help me verify that they are able to retrieve the content from the Glassfish application servers even when HAProxy is reporting the 503 errors. One thing I did notice about the HAProxy configuration is that there are actually three servers running HAProxy with identical configurations. One serves as the primary high availability load balancer while the other two act as failovers. The keep-alive daemons are configured to accomodate that setup. From this generic description, is there something in the way this architecture is set up or in the configuration of HAProxy that may be causing the 503 errors to be reported to Akamai? As I mentioned, when an external resource makes a request for the same content directly from the application server, the same errors do not appear to occur. thanks, C.Y.