Hi list, I'm tuning some HAProxy instances in front of a large kubernetes cluster. The config has about 500 hostnames (a la apache/nginx virtual hosts), 3 frontends, 1500 backends and 4000 servers. The first frontend is on tcp mode binding :443, inspecting sni and doing a triage; the second frontend is binding a unix socket with ca-file (tls authentication); the last frontend is binding another unix socket, doing ssl-offload but without ca-file. This last one has about 80% of the hostnames. There is also a ssl-passthrough config - from the triage frontend straight to a tcp backend.
I'm observing some latency on moderate loads (200+ rps per instance) - on my tests, the p95 was about 25ms only in the proxy, and the major issue is that I cannot have a throughput above 600 rps. This latency moves easily from 25ms on p95 to 1s or more on p50 with 700+ rps. The problem is of course the big amount of rules in the frontend: haproxy need to check every single bit of configuration for every single host and every single path. Moving the testing hostname to a dedicated frontend with only its own rules give me with about 5ms of p95 latency and more than 5000 rps. These are my ideas so far regarding tune such configuration: * Move all possible rules to the backend. Some txn vars should be created in order to be inspected there. This will of course help but there is still a lot of `use_backend if <host-acl> <path-acl>` that cannot be removed, I think, which are being evaluated on every single request despite the hostname that I'm really interested. There are some hostnames without path acl, but there are also hostnames with 10+ different paths and its 10+ `use_backend`. * Create some more frontends and unix sockets with at most 50 hostnames or so. Pros: after the triage, each frontend will have the `use_backend if` of only another 49 hostnames. Cons: if some client doesn't send the sni extension, the right frontend couldn't be found. * Perhaps there is a hidden `if <acl> do <some keywords here> done` that I'm missing which would improve performance, since I can help HAProxy to process only the keywords I'm really interested in that request. * Nbthreads was already tested, I'm using 3 that has the best performance on a 8 cores VM. 4+ threads doesn’t scale. Nbprocs will also be used, I'm tuning a per process configuration now. Is there any other approach I'm missing? Every single milisecond will help.

