Re: Healthchecks with many nbprocs

2016-06-21 Thread Daniel Ylitalo

Thanks!

That helped quite alot with a 1s cache :)

Best regards
Daniel Ylitalo
System & Network manager

about.mytaste.com 



"Experience is something you earn just right after you screwed up and 
were really in need of it"


Den 2016-06-20 kl. 17:50, skrev CJ Ess:
We have pools of Haproxy talking to pools of Nginx servers with 
php-fpm backends. We were seeing 50-60 health checks per second, all 
of which had to be serviced by the php-fpm process and which almost 
always returned the same result except for the rare memory or nic 
failure. So we used the Nginx's cache feature with a 1 second ttl in 
front of our application's health check endpoint so that the first 
request will actually hit the backend and the other health check 
requests queue up behind the first (fastcgi_cache_lock). We set a 
250ms timeout on the lock so that health checks don't queue forever 
(fastcgi_cache_lock_timeout).


On Mon, Jun 20, 2016 at 7:44 AM, Daniel Ylitalo 
> wrote:


Hi!

I haven't found anything about this topic anywhere so I was hoping
someone in the mailinglist has done this in the past :)

We are at the size where we need to round-robin tcp balance our
incoming web traffic with pf to two haproxy servers both running
with nbproc 28 for http load balancing, however, this leads to 56
healthchecks being done each second against our web nodes which
hammers them quite hard.

How exactly are you guys solving this issue? Because at this size,
the healthchecks kind of starts eating more cpu than they are helpful.

-- 
Daniel Ylitalo

System & Network manager

about.mytaste.com 



"Experience is something you earn just right after you screwed up
and were really in need of it"






Re: Healthchecks with many nbprocs

2016-06-20 Thread CJ Ess
We have pools of Haproxy talking to pools of Nginx servers with php-fpm
backends. We were seeing 50-60 health checks per second, all of which had
to be serviced by the php-fpm process and which almost always returned the
same result except for the rare memory or nic failure. So we used the
Nginx's cache feature with a 1 second ttl in front of our application's
health check endpoint so that the first request will actually hit the
backend and the other health check requests queue up behind the first
(fastcgi_cache_lock). We set a 250ms timeout on the lock so that health
checks don't queue forever (fastcgi_cache_lock_timeout).

On Mon, Jun 20, 2016 at 7:44 AM, Daniel Ylitalo 
wrote:

> Hi!
>
> I haven't found anything about this topic anywhere so I was hoping someone
> in the mailinglist has done this in the past :)
>
> We are at the size where we need to round-robin tcp balance our incoming
> web traffic with pf to two haproxy servers both running with nbproc 28 for
> http load balancing, however, this leads to 56 healthchecks being done each
> second against our web nodes which hammers them quite hard.
>
> How exactly are you guys solving this issue? Because at this size, the
> healthchecks kind of starts eating more cpu than they are helpful.
>
> --
> Daniel Ylitalo
> System & Network manager
>
> about.mytaste.com
>
>
>
> "Experience is something you earn just right after you screwed up and were
> really in need of it"
>
>


Re: Healthchecks with many nbprocs

2016-06-20 Thread Pavlos Parissis
On 20/06/2016 04:44 πμ, Daniel Ylitalo wrote:
> Hi!
> 
> I haven't found anything about this topic anywhere so I was hoping someone in
> the mailinglist has done this in the past :)
> 
> We are at the size where we need to round-robin tcp balance our incoming web
> traffic with pf to two haproxy servers both running with nbproc 28 for http 
> load
> balancing, however, this leads to 56 healthchecks being done each second 
> against
> our web nodes which hammers them quite hard.
> 

Shall I assume you are using HTTPS as well? As 28 processes could be way too
many for HTTP, unless you have 40GbE links but even then it is too much.

> How exactly are you guys solving this issue? Because at this size, the
> healthchecks kind of starts eating more cpu than they are helpful.
> 

Well, several things can be done:

- Centralize the state of check and offload the execution on the target node

Several companies use Zookeeper or Consul to store the state of the health check
and run a daemon on backend servers which performs the actual health checking
and update the state on zookeeper or Consul. They run another daemon on HAProxy
servers which reacts when the state of server change and either enable or
disable the server on HAProxy via stats socket and update configuration file as
well. The latter isn't needed anymore as HAProxy preserves the state of server
across reloads with
http://cbonte.github.io/haproxy-dconv/configuration-1.6.html#3.1-server-state-file

In this setup HAProxy only performs a TCP check in order to cover the case of
network partition, HAProxy can't reach servers while the whole chain for health
checking works

Another disadvantage of running health checks on HAProxy when nbproc >1 is that
these processes don't always agree on the status. This problem becomes bigger
when you have more than HAProxy servers. All these processes are acting as
different brains and they never agree on something at a specific time.

Centralizing the store of the state and have only 1 brain (daemon on backend
server) doing the check avoids this problem.

- split the frontends to HTTPS and HTTP

If HTTPS is not mandatory on the backend then you can have 20 processes which
handle HTTPS traffic and offload to 2 processes which forward traffic to backend
servers in clear text.

- Disable hyper-threading on Intel

We disabled it on our servers and we went from 22 processes to 10, which dropped
capacity by ~8%.

Hope it helps,
Pavlos






signature.asc
Description: OpenPGP digital signature