Hi list, I'm tuning some HAProxy instances in front of a large kubernetes 
cluster. The config has about 500 hostnames (a la apache/nginx virtual hosts), 
3 frontends, 1500 backends and 4000 servers. The first frontend is on tcp mode 
binding :443, inspecting sni and doing a triage; the second frontend is binding 
a unix socket with ca-file (tls authentication); the last frontend is binding 
another unix socket, doing ssl-offload but without ca-file. This last one has 
about 80% of the hostnames. There is also a ssl-passthrough config - from the 
triage frontend straight to a tcp backend.

I'm observing some latency on moderate loads (200+ rps per instance) - on my 
tests, the p95 was about 25ms only in the proxy, and the major issue is that I 
cannot have a throughput above 600 rps. This latency moves easily from 25ms on 
p95 to 1s or more on p50 with 700+ rps. The problem is of course the big amount 
of rules in the frontend: haproxy need to check every single bit of 
configuration for every single host and every single path. Moving the testing 
hostname to a dedicated frontend with only its own rules give me with about 5ms 
of p95 latency and more than 5000 rps.

These are my ideas so far regarding tune such configuration:

* Move all possible rules to the backend. Some txn vars should be created in 
order to be inspected there. This will of course help but there is still a lot 
of `use_backend if <host-acl> <path-acl>` that cannot be removed, I think, 
which are being evaluated on every single request despite the hostname that I'm 
really interested. There are some hostnames without path acl, but there are 
also hostnames with 10+ different paths and its 10+ `use_backend`.

* Create some more frontends and unix sockets with at most 50 hostnames or so. 
Pros: after the triage, each frontend will have the `use_backend if` of only 
another 49 hostnames. Cons: if some client doesn't send the sni extension, the 
right frontend couldn't be found.

* Perhaps there is a hidden `if <acl> do <some keywords here> done` that I'm 
missing which would improve performance, since I can help HAProxy to process 
only the keywords I'm really interested in that request.

* Nbthreads was already tested, I'm using 3 that has the best performance on a 
8 cores VM. 4+ threads doesn’t scale. Nbprocs will also be used, I'm tuning a 
per process configuration now.

Is there any other approach I'm missing? Every single milisecond will help.

Reply via email to