TCP retransmits under load

2011-01-06 Thread Dmitri Smirnov

Hi, all,

I have been having a peculiar case of TCP connect retransmits that our 
clients experience while using our WebService. Running in EC2 on CentOS 
version 1.4.8 though I am planning to migrate to 1.4.10.


At first, we saw some SYN flooding messages which we fixed by changing 
net.core.somaxconn kernel parameters so it would allow us a bigger 
listen backlog instead of limiting us to 128 by default.


That did made some visible difference but retransmits still appear at 
peak loads enough to affect the quality of service as  clients time out 
on connect. Peak loads represent 1,600-1,900 rps per haproxy instance.


We did take some TCP dumps and we are trying to make sense out of them now.

Basically, it seems that retransmits on connect starts occurring when a 
number of conncurrent sessions on haproxy side hits around 60. That puts 
a lot of pressure on us.


I have some limits in place based on the prior experience but my 
understanding that they are all implemented on HTTP level.


In general I fail to see how haproxy would affect SYN/ACK sequences that 
are handled by tcp stack.


I think that haproxy is not rejecting anything on frontend and accepts 
all the connections.


My conifg is below. I have 2 limits implemented:

- ACL based that checks the number of concurrent sessions on frontend. 
If exceeded it redirects the connection to overload backend that results 
in 503.


- per server limit. My understanding that if one or more servers hit the 
limit then the connection is placed into the queue for 1us and then 
immediately expires. This again should produce 503/504 response.


looking forward for any suggestions. thanks.


global
daemon
maxconn 1

defaults
mode http
balance uri
hash-type consistent

log /dev/log local0
option httplog
backlog 3000

timeout client  1s
timeout server  2s
timeout connect 500ms
#  We choose to have a small timeout for the queue to excessive
# connections beyond the limit on the individual server side are
# quickly rejected. We choose the smallest supported unit 1us.
timeout queue  1us

default_backend servers

frontend http-in

# Maximum connections on the frontend
   bind *:8080
   monitor-uri /status
   acl site_dead nbsrv(servers) lt 1
   monitor fail if site_dead

maxconn 3000
acl too_many fe_conn gt 420
acl too_fast fe_sess_rate gt 3000
use_backend overload if too_many or too_fast

backend overload

backend servers
stats enable
stats uri /haproxy?status
stats refresh 5s
stats show-legends
stats show-node

acl cloud_req dst_port eq 8080
stats http-request allow if cloud_req

option forceclose
option abortonclose
option forwardfor
option redispatch
retries 1

server svr1 host:port check inter 1000 rise 5 fall 3 maxconn 28
[...]
server svrN host:port check inter 1000 rise 5 fall 3 maxconn 28

--
Dmitri Smirnov




Stats question

2010-12-27 Thread Dmitri Smirnov
For a backend running in http mode what exactly do the following metrics 
include?


10. dreq: denied requests - apparently only for FE. What does this include?
11. dresp: denied responses
12. ereq: request errors - What does this include?

13. econ: connection errors

Does the above include connection failures due to maxconn limits?
Entries that expired from the queue?

14. eresp: response errors

Does the above mean HTTP status codes or some other kinds of errors?

When hovering a mouse over Session Total I can see a tool-tip with a 
http codes breakdown.  Is there a way I can get those via http interface 
or some other means?


thanks,
--
Dmitri Smirnov




monitor-uri does not respond

2010-12-08 Thread Dmitri Smirnov


Setup a monitor-uri /status on one of the frontends, however,

GET http://host:port/status returns nothing and health checks fail for a 
load balancer. Security permissions are verified and OK.


mode is http in the defaults section.

I was expecting that 200 is returned.

What am I missing?

--
Dmitri Smirnov
Y!IM: ia_slepukhin



Re: monitor-uri does not respond

2010-12-08 Thread Dmitri Smirnov
I think it is a bug. The manual said that monitor requests are responded 
to before ACLs are evaluated. By ACLs I mean the acls like this:


   monitor-uri /status
   acl valid_src1 hdr_ip(X-Forwarded-For) xx
   acl valid_src2 hdr_ip(X-Forwarded-For) xx
   tcp-request content reject unless valid_src1 or valid_src2

As soon as I remove ACL check monitor-uri starts responding.

I am running 1.4.8

On 12/08/2010 10:22 AM, Dmitri Smirnov wrote:


Setup a monitor-uri /status on one of the frontends, however,

GET http://host:port/status returns nothing and health checks fail for a
load balancer. Security permissions are verified and OK.

mode is http in the defaults section.

I was expecting that 200 is returned.

What am I missing?





Re: Limiting throughput with a cold cache

2010-11-16 Thread Dmitri Smirnov

Thank you all for your help.

On 11/16/2010 03:45 AM, Bedis 9 wrote:

Hi,

By the way, why using Squid?
Have you already tried Varnish? It has a grace function which might
help you :)
During grace period, varnish serves stale (but cacheable) objects
while retriving object from backend.
Of course, it depends if that solution makes sense with your application.


Our version of squid has a patch that does revalidate in the background 
while still serving stale data. Used it back at Y!.


We are not persisting squids to disk and that may be helpful in some 
circumstances.


--
Dmitri Smirnov
Y!IM: ia_slepukhin



Re: Limiting throughput with a cold cache

2010-11-15 Thread Dmitri Smirnov

Willy,

thank you for taking time to respond. This is always thought provoking.

On 11/13/2010 12:18 AM, Willy Tarreau wrote:


Why reject instead of redispatching ? Do you fear that non-cached requests
will have a domino effect on all your squids ?


Yes, this indeed happens. Also, the objective is not to exceed the 
number of connections from squids to backend database. In case of cold 
cache the redispatch will cause a cache entry to be brought into the 
wrong shard which is unlikely to be reused. Thus this would use up a 
valuable connection just to satisfy one request.


However, even this is an optimistic scenario. These cold cache 
situations happen due to external factors like AWS issue, our home grown 
DNS messed up (AWS does not provide DNS) and etc which causes not all of 
the squids to be reported to proxies and messes up the distribution. 
This is because haproxy is restarted after the config file is regenerated.


I have been thinking about preserving some of the distribution using 
server IDs when the set of squids partially changes but that's another 
story, let's not digress.


Thus even with redispatch enabled the other squid is unlikely to have 
free connection slots because when it goes cold, most of them do.


Needless to say, most of the other components in the system are also in 
distress in case something happens on a large scale. So I choose the 
stability of the system to be the priority even though some of the 
clients will be refused service which happens to be the least of the evils.




I don't see any http-server-close there, which makes me think that your
squids are selected upon the first request of a connection and will still
have to process all subsequent requests, even if their URL don't hash to
them.


Good point. This is not the case, however, forceclose takes care of it 
and I can see that most of the time the number of concurrently open 
connections to any particular squid changes very quickly in a range of 
0-3 even though each of them handles a chunk requests per second.




Your client timeout at 100ms seems way too low to me. I'm sure you'll
regularly get truncated responses due to packet losses. In my opinion,
both your client and server timeouts must cover at least a TCP retransmit
(3s), so that makes sense to set them to at least 4-5 seconds. Also, do
not forget that sometimes your cache will have to fetch an object before
responding, so it could make sense to have a server timeout which is at
least as big as the squid timeout.


Agree. Right now server timeout in prod is 5s according to what is 
recommended in the docs. In fact, I will probably reverse my timeout 
changes to be inline with your recommendations.


Having slept on the problem I came up with a fairly simple idea which is 
not perfect but I think does most of the bang for such a simple change.


It revolves around of adding a maxconn restriction for every individual 
squid on in the backend.


And the number can be easily calculated and then tuned after a loadtest.

Let's assume I have 1 haproxy in front of a single squid.

Furthermore, HIT latency: 5ms, MISS latency 200ms for simplicity.

Incoming traffic 1,000 rps at peak.

From squids to backend lets have 50 connections max, i.e. with 250 rps max.

So through a single connection allowance for hot caches we will be able 
to process 200 rps. For cold cache we will do only 5 rps.


This means that to support Hot traffic we need 5 connections at least.
At the same time this will throttle MISS requests to max of 25.

Because we have 250 rps max at the backend we can raise maxconn to 50 
for the squid. This creates a range of 250-10,000 rps.


As caches warm up the traffic becomes mixed and drifts towards the hot 
model so the same number of connections will process more and more 
requests until it reached 99.6% hit rate in our case.


I choose to leave a queue size unlimited but put a fast expiration time 
on the queue entries so they are rejected with 503 unless you have other 
recommendations.


I also choose to impose an individual maxconn rather than a backend 
maxconn. This is to prevent MISS requests to use up all of the 
connections limit and allow HITs to be served quickly from hot shards.

I am still pondering over this point though.

The situation would be more complicated if the maxconn was too big for 
MISS and too small for HIT but this is not the case.


The biggest problem remaining: clients stop seeing rejections when at 
least 5 connections are available for HIT traffic. This means that MISS 
traffic should be at 225 rps at the most, i.e. caches must be  77% hot.


I will test experimentally if this takes too long.

thanks,
--
Dmitri Smirnov




Limiting throughput with a cold cache

2010-11-12 Thread Dmitri Smirnov

Hi,

We have been using haproxy for a few months now and the benefits have 
been immense. This list in particular is an indispensable resource.


We use it in the cloud to consistently distribute requests among the 
squids using haproxy 1.4.8.


We run N proxies in front of M squids in different availability zones 
with the same configuration.


It also shields the clients from the volatile nature of amazon instances 
behind the proxies as proxies instantly redispatch requests when squids 
do down.


By doing this we, of course, loose a portion of the cache but it is 
acceptable when only 1 or 2 squids are out.


This brings me to the biggest challenge we have currently and that is of 
a cold or mostly cold cache.


There is a drastic difference in the performance characteristics for the 
system when caches are cold and when they are hot.


While we are hot we serve about 6,000-10,000 rps, queues to backends are 
zero, the number of concurrent connections to backends is near zero.
This boils down to a range 600 - 1,000 rps per haproxy instance for 10 
instances.


With cold caches or when the distribution is thrown off the latencies 
shoot up 2-3 orders of magnitude and the number of concurrent 
connections to squids goes up to hundreds. This lead to a flood client 
retries( being fixed now) often maxing out the number of sockets leading 
haproxy to believe that squids are not reachable and marking them down 
(flip/flop). This led to redispatches making the picture even worse.


The limiting factor here is a latency and the number of concurrent 
persistent connections that can be established back to the database from 
squids which I believe to be around 500.

Naturally, we have caches here in the first place for the reason.
This problem will require some time to address and is being actively 
worked on.


While there are some fundamental problems here to work on, I was 
wondering if I could quickly tweak haproxies configuration to gracefully 
support both modes of operation in a short term since it is currently 
the only place in a chain where a powerful scripting can be done.


The objectives are:

1) allow maximum possible throughput when caches are hot

2) When caches are cold sustain a level of throughput that will allow 
caches to warm up w/o melting the system down.


3) detect a slowdown by checking or/and
- avg_queue size
- queue size
- number of concurrent connections going up

4) Quickly reject requests that come beyond predetermined cold cache 
capacity. If possible, do it on an individual server level rather than 
on a backend level ( for cases when only some caches are cold).


One of the issues here is that if I specify maxconn for the individual 
server, the connection is not rejected but goes to a queue. If I limit 
the queue size then when timeout expires it will redispatch to another 
server.  I want re-dispatches only when a squid is down.


Below is a version of config under construction and somewhat simplified.

I will work out the exact numbers later. Right now server maxconn, 
slowstart timeout and queue threshold is a pure speculation.


I would appreciate any help as I am trying to wrap my brain around a lot 
of variables here and available tuning knobs.


--
Dmitri Smirnov

# This is a CE haproxy test config boilerplate
global
daemon
stats socket /apps/haproxy/var/stats level admin
maxconn 1

defaults
mode http
balance uri
hash-type consistent
# local0 needs to be configured at /etc/syslog.conf
log /dev/log local0
option httplog

# Maximum number of concurrent connections on the frontend
# set to be the half of the total max in the global section above
maxconn 5000

# timeout client is the max time of client inactivity
# when the client is expected to ack or send data
# we do not want to tie up for long time
timeout client  100ms

# This is a max time to wait for connection to a server to succeed
timeout connect 200ms

# This is a maximum timeout to wait in a queue at the backend
# by default it is the same as timeout connect but we set it explicitely
# Below we do not allow the queue to grow beyond 1 as this indicates 
that servers

# are slow and overloaded.
timeout queue 200ms

# Maximum inactivity timeout for the server to ack or send data
# In other words, in situtions of meltdown we are not going to wait for 
slow data to come back ( not what is currently in prod)

# but this will still hopefully allow squid to refill
# max time is usually less than a second
timeout server 1000ms

frontend http-in
   bind *:8080
   default_backend servers

# Problem, if one squid is cold this reject requests for the whole farm
   acl q_too_long avg_queue(servers) gt 0
   use_backend overload if q_too_long

backend overload
# HAproxy will issue 503 because no servers available for this backend
# Here we customize the response
errorfile 503 /apps/haproxy/etc/fe_503.http

backend servers
stats enable
stats uri /haproxy

Consistent hashing question

2010-10-06 Thread Dmitri Smirnov

Hi all,

While doing consistent hashing I observed ( as expected) that the order 
of backend servers in the configuratio affects the distribution of the load.


Being in the cloud, I am forced to regenerate the configuration file and 
restart  because both the public host name and their addresses change 
most of the time as instances are replaced. To keep the distribution 
consistent I sort the list of servers by host name.


However, I am not sure if this is exactly right thing to do.

Question, what exactly is inserted into the consistent hashing tree:

1) name of the server that I specify
2) host name
3) resolved ten dot address.

I am not looking at the source code now in  hopes that the community 
will provide more insight faster,


thank you,

--
Dmitri Smirnov





maxconn settings

2010-08-23 Thread Dmitri Smirnov
I would like to configure haproxy in a way that stats page is always available.

Currently, it is served from the same port as data as such I suspect is subject 
to
maxconn setting restrictions both for global and frontend configuration 
sections.

Is there a way to configure it in a way that even if haproxy hits maxconn we 
still can
GET stats, otherwise it appears as if the instance is down all together?

I am not sure if this is a real problem since I could not find anything in the 
doc.
However, it appears that it is.

thanks,
Dmitri


Re: Is there anyway to get latency from stats interface?

2010-08-12 Thread Dmitri Smirnov
On Aug 11, 2010, at 2:28 PM, Willy Tarreau wrote:

 
 
 For our product haproxy represents the actual endpoint and just looking at 
 the stats it would be really helpful to see what the latency per server or 
 even the whole backend which then can be plotted.
 
 Well, added to the 1.5 roadmap now ;-)

For extracting data and putting it into an external monitoring system, say once 
a minute, I tend to favor 
GET http://haproxy_host/haproxy?status;csv;norefresh and parse it.

thanks for your response,
Dmitri

 
 Cheers,
 Willy
 
 
 




Is there anyway to get latency from stats interface?

2010-08-10 Thread Dmitri Smirnov
Hi,

i have been using haproxy for a couple of months and the product is very solid, 
thank you.
It is now taking production traffic.

The stats page is very useful with respect to rps and etc.
In our case it would be helpful to add a average backend latency over a 
second/minute
and or per individual b/e server.

I know haproxy logs per request latencies but parsing logs at these rates is 
not fun.

For our product haproxy represents the actual endpoint and just looking at the 
stats it would be really helpful to see what the latency per server or even the 
whole backend which then can be plotted.

thanks,
Dmitri