TCP retransmits under load
Hi, all, I have been having a peculiar case of TCP connect retransmits that our clients experience while using our WebService. Running in EC2 on CentOS version 1.4.8 though I am planning to migrate to 1.4.10. At first, we saw some SYN flooding messages which we fixed by changing net.core.somaxconn kernel parameters so it would allow us a bigger listen backlog instead of limiting us to 128 by default. That did made some visible difference but retransmits still appear at peak loads enough to affect the quality of service as clients time out on connect. Peak loads represent 1,600-1,900 rps per haproxy instance. We did take some TCP dumps and we are trying to make sense out of them now. Basically, it seems that retransmits on connect starts occurring when a number of conncurrent sessions on haproxy side hits around 60. That puts a lot of pressure on us. I have some limits in place based on the prior experience but my understanding that they are all implemented on HTTP level. In general I fail to see how haproxy would affect SYN/ACK sequences that are handled by tcp stack. I think that haproxy is not rejecting anything on frontend and accepts all the connections. My conifg is below. I have 2 limits implemented: - ACL based that checks the number of concurrent sessions on frontend. If exceeded it redirects the connection to overload backend that results in 503. - per server limit. My understanding that if one or more servers hit the limit then the connection is placed into the queue for 1us and then immediately expires. This again should produce 503/504 response. looking forward for any suggestions. thanks. global daemon maxconn 1 defaults mode http balance uri hash-type consistent log /dev/log local0 option httplog backlog 3000 timeout client 1s timeout server 2s timeout connect 500ms # We choose to have a small timeout for the queue to excessive # connections beyond the limit on the individual server side are # quickly rejected. We choose the smallest supported unit 1us. timeout queue 1us default_backend servers frontend http-in # Maximum connections on the frontend bind *:8080 monitor-uri /status acl site_dead nbsrv(servers) lt 1 monitor fail if site_dead maxconn 3000 acl too_many fe_conn gt 420 acl too_fast fe_sess_rate gt 3000 use_backend overload if too_many or too_fast backend overload backend servers stats enable stats uri /haproxy?status stats refresh 5s stats show-legends stats show-node acl cloud_req dst_port eq 8080 stats http-request allow if cloud_req option forceclose option abortonclose option forwardfor option redispatch retries 1 server svr1 host:port check inter 1000 rise 5 fall 3 maxconn 28 [...] server svrN host:port check inter 1000 rise 5 fall 3 maxconn 28 -- Dmitri Smirnov
Stats question
For a backend running in http mode what exactly do the following metrics include? 10. dreq: denied requests - apparently only for FE. What does this include? 11. dresp: denied responses 12. ereq: request errors - What does this include? 13. econ: connection errors Does the above include connection failures due to maxconn limits? Entries that expired from the queue? 14. eresp: response errors Does the above mean HTTP status codes or some other kinds of errors? When hovering a mouse over Session Total I can see a tool-tip with a http codes breakdown. Is there a way I can get those via http interface or some other means? thanks, -- Dmitri Smirnov
monitor-uri does not respond
Setup a monitor-uri /status on one of the frontends, however, GET http://host:port/status returns nothing and health checks fail for a load balancer. Security permissions are verified and OK. mode is http in the defaults section. I was expecting that 200 is returned. What am I missing? -- Dmitri Smirnov Y!IM: ia_slepukhin
Re: monitor-uri does not respond
I think it is a bug. The manual said that monitor requests are responded to before ACLs are evaluated. By ACLs I mean the acls like this: monitor-uri /status acl valid_src1 hdr_ip(X-Forwarded-For) xx acl valid_src2 hdr_ip(X-Forwarded-For) xx tcp-request content reject unless valid_src1 or valid_src2 As soon as I remove ACL check monitor-uri starts responding. I am running 1.4.8 On 12/08/2010 10:22 AM, Dmitri Smirnov wrote: Setup a monitor-uri /status on one of the frontends, however, GET http://host:port/status returns nothing and health checks fail for a load balancer. Security permissions are verified and OK. mode is http in the defaults section. I was expecting that 200 is returned. What am I missing?
Re: Limiting throughput with a cold cache
Thank you all for your help. On 11/16/2010 03:45 AM, Bedis 9 wrote: Hi, By the way, why using Squid? Have you already tried Varnish? It has a grace function which might help you :) During grace period, varnish serves stale (but cacheable) objects while retriving object from backend. Of course, it depends if that solution makes sense with your application. Our version of squid has a patch that does revalidate in the background while still serving stale data. Used it back at Y!. We are not persisting squids to disk and that may be helpful in some circumstances. -- Dmitri Smirnov Y!IM: ia_slepukhin
Re: Limiting throughput with a cold cache
Willy, thank you for taking time to respond. This is always thought provoking. On 11/13/2010 12:18 AM, Willy Tarreau wrote: Why reject instead of redispatching ? Do you fear that non-cached requests will have a domino effect on all your squids ? Yes, this indeed happens. Also, the objective is not to exceed the number of connections from squids to backend database. In case of cold cache the redispatch will cause a cache entry to be brought into the wrong shard which is unlikely to be reused. Thus this would use up a valuable connection just to satisfy one request. However, even this is an optimistic scenario. These cold cache situations happen due to external factors like AWS issue, our home grown DNS messed up (AWS does not provide DNS) and etc which causes not all of the squids to be reported to proxies and messes up the distribution. This is because haproxy is restarted after the config file is regenerated. I have been thinking about preserving some of the distribution using server IDs when the set of squids partially changes but that's another story, let's not digress. Thus even with redispatch enabled the other squid is unlikely to have free connection slots because when it goes cold, most of them do. Needless to say, most of the other components in the system are also in distress in case something happens on a large scale. So I choose the stability of the system to be the priority even though some of the clients will be refused service which happens to be the least of the evils. I don't see any http-server-close there, which makes me think that your squids are selected upon the first request of a connection and will still have to process all subsequent requests, even if their URL don't hash to them. Good point. This is not the case, however, forceclose takes care of it and I can see that most of the time the number of concurrently open connections to any particular squid changes very quickly in a range of 0-3 even though each of them handles a chunk requests per second. Your client timeout at 100ms seems way too low to me. I'm sure you'll regularly get truncated responses due to packet losses. In my opinion, both your client and server timeouts must cover at least a TCP retransmit (3s), so that makes sense to set them to at least 4-5 seconds. Also, do not forget that sometimes your cache will have to fetch an object before responding, so it could make sense to have a server timeout which is at least as big as the squid timeout. Agree. Right now server timeout in prod is 5s according to what is recommended in the docs. In fact, I will probably reverse my timeout changes to be inline with your recommendations. Having slept on the problem I came up with a fairly simple idea which is not perfect but I think does most of the bang for such a simple change. It revolves around of adding a maxconn restriction for every individual squid on in the backend. And the number can be easily calculated and then tuned after a loadtest. Let's assume I have 1 haproxy in front of a single squid. Furthermore, HIT latency: 5ms, MISS latency 200ms for simplicity. Incoming traffic 1,000 rps at peak. From squids to backend lets have 50 connections max, i.e. with 250 rps max. So through a single connection allowance for hot caches we will be able to process 200 rps. For cold cache we will do only 5 rps. This means that to support Hot traffic we need 5 connections at least. At the same time this will throttle MISS requests to max of 25. Because we have 250 rps max at the backend we can raise maxconn to 50 for the squid. This creates a range of 250-10,000 rps. As caches warm up the traffic becomes mixed and drifts towards the hot model so the same number of connections will process more and more requests until it reached 99.6% hit rate in our case. I choose to leave a queue size unlimited but put a fast expiration time on the queue entries so they are rejected with 503 unless you have other recommendations. I also choose to impose an individual maxconn rather than a backend maxconn. This is to prevent MISS requests to use up all of the connections limit and allow HITs to be served quickly from hot shards. I am still pondering over this point though. The situation would be more complicated if the maxconn was too big for MISS and too small for HIT but this is not the case. The biggest problem remaining: clients stop seeing rejections when at least 5 connections are available for HIT traffic. This means that MISS traffic should be at 225 rps at the most, i.e. caches must be 77% hot. I will test experimentally if this takes too long. thanks, -- Dmitri Smirnov
Limiting throughput with a cold cache
Hi, We have been using haproxy for a few months now and the benefits have been immense. This list in particular is an indispensable resource. We use it in the cloud to consistently distribute requests among the squids using haproxy 1.4.8. We run N proxies in front of M squids in different availability zones with the same configuration. It also shields the clients from the volatile nature of amazon instances behind the proxies as proxies instantly redispatch requests when squids do down. By doing this we, of course, loose a portion of the cache but it is acceptable when only 1 or 2 squids are out. This brings me to the biggest challenge we have currently and that is of a cold or mostly cold cache. There is a drastic difference in the performance characteristics for the system when caches are cold and when they are hot. While we are hot we serve about 6,000-10,000 rps, queues to backends are zero, the number of concurrent connections to backends is near zero. This boils down to a range 600 - 1,000 rps per haproxy instance for 10 instances. With cold caches or when the distribution is thrown off the latencies shoot up 2-3 orders of magnitude and the number of concurrent connections to squids goes up to hundreds. This lead to a flood client retries( being fixed now) often maxing out the number of sockets leading haproxy to believe that squids are not reachable and marking them down (flip/flop). This led to redispatches making the picture even worse. The limiting factor here is a latency and the number of concurrent persistent connections that can be established back to the database from squids which I believe to be around 500. Naturally, we have caches here in the first place for the reason. This problem will require some time to address and is being actively worked on. While there are some fundamental problems here to work on, I was wondering if I could quickly tweak haproxies configuration to gracefully support both modes of operation in a short term since it is currently the only place in a chain where a powerful scripting can be done. The objectives are: 1) allow maximum possible throughput when caches are hot 2) When caches are cold sustain a level of throughput that will allow caches to warm up w/o melting the system down. 3) detect a slowdown by checking or/and - avg_queue size - queue size - number of concurrent connections going up 4) Quickly reject requests that come beyond predetermined cold cache capacity. If possible, do it on an individual server level rather than on a backend level ( for cases when only some caches are cold). One of the issues here is that if I specify maxconn for the individual server, the connection is not rejected but goes to a queue. If I limit the queue size then when timeout expires it will redispatch to another server. I want re-dispatches only when a squid is down. Below is a version of config under construction and somewhat simplified. I will work out the exact numbers later. Right now server maxconn, slowstart timeout and queue threshold is a pure speculation. I would appreciate any help as I am trying to wrap my brain around a lot of variables here and available tuning knobs. -- Dmitri Smirnov # This is a CE haproxy test config boilerplate global daemon stats socket /apps/haproxy/var/stats level admin maxconn 1 defaults mode http balance uri hash-type consistent # local0 needs to be configured at /etc/syslog.conf log /dev/log local0 option httplog # Maximum number of concurrent connections on the frontend # set to be the half of the total max in the global section above maxconn 5000 # timeout client is the max time of client inactivity # when the client is expected to ack or send data # we do not want to tie up for long time timeout client 100ms # This is a max time to wait for connection to a server to succeed timeout connect 200ms # This is a maximum timeout to wait in a queue at the backend # by default it is the same as timeout connect but we set it explicitely # Below we do not allow the queue to grow beyond 1 as this indicates that servers # are slow and overloaded. timeout queue 200ms # Maximum inactivity timeout for the server to ack or send data # In other words, in situtions of meltdown we are not going to wait for slow data to come back ( not what is currently in prod) # but this will still hopefully allow squid to refill # max time is usually less than a second timeout server 1000ms frontend http-in bind *:8080 default_backend servers # Problem, if one squid is cold this reject requests for the whole farm acl q_too_long avg_queue(servers) gt 0 use_backend overload if q_too_long backend overload # HAproxy will issue 503 because no servers available for this backend # Here we customize the response errorfile 503 /apps/haproxy/etc/fe_503.http backend servers stats enable stats uri /haproxy
Consistent hashing question
Hi all, While doing consistent hashing I observed ( as expected) that the order of backend servers in the configuratio affects the distribution of the load. Being in the cloud, I am forced to regenerate the configuration file and restart because both the public host name and their addresses change most of the time as instances are replaced. To keep the distribution consistent I sort the list of servers by host name. However, I am not sure if this is exactly right thing to do. Question, what exactly is inserted into the consistent hashing tree: 1) name of the server that I specify 2) host name 3) resolved ten dot address. I am not looking at the source code now in hopes that the community will provide more insight faster, thank you, -- Dmitri Smirnov
maxconn settings
I would like to configure haproxy in a way that stats page is always available. Currently, it is served from the same port as data as such I suspect is subject to maxconn setting restrictions both for global and frontend configuration sections. Is there a way to configure it in a way that even if haproxy hits maxconn we still can GET stats, otherwise it appears as if the instance is down all together? I am not sure if this is a real problem since I could not find anything in the doc. However, it appears that it is. thanks, Dmitri
Re: Is there anyway to get latency from stats interface?
On Aug 11, 2010, at 2:28 PM, Willy Tarreau wrote: For our product haproxy represents the actual endpoint and just looking at the stats it would be really helpful to see what the latency per server or even the whole backend which then can be plotted. Well, added to the 1.5 roadmap now ;-) For extracting data and putting it into an external monitoring system, say once a minute, I tend to favor GET http://haproxy_host/haproxy?status;csv;norefresh and parse it. thanks for your response, Dmitri Cheers, Willy
Is there anyway to get latency from stats interface?
Hi, i have been using haproxy for a couple of months and the product is very solid, thank you. It is now taking production traffic. The stats page is very useful with respect to rps and etc. In our case it would be helpful to add a average backend latency over a second/minute and or per individual b/e server. I know haproxy logs per request latencies but parsing logs at these rates is not fun. For our product haproxy represents the actual endpoint and just looking at the stats it would be really helpful to see what the latency per server or even the whole backend which then can be plotted. thanks, Dmitri