4xx statistics made useless through health checks?

Daniel Schneller Tue, 21 Nov 2017 01:09:33 -0800

Hello everyone,

I assume this is a somewhat common use case, but I cannot quite wrap my head 
around how to solve this elegantly.


We have several application servers that use a local haproxy instance to talk 
to. These sidecar instances all know about a set of central load balancers 
(also haproxy) and use resolvers to keep track of any changes to those.

The central LBs decrypt the traffic, and then dispatch the requests to where 
they actually belong. This allows us to keep the configuration of the sidecars 
very simple.

However, I see lots of 4xx errors counted on the central LBs. I have tracked 
those down to being caused by the health checks of all the sidecars, checking 
in every few seconds to see if their backends are healthy.

The log shows this:

Nov 21 09:58:08 loadbalancer-01 haproxy-internal[4053]: 10.205.100.53:54831 
[21/Nov/2017:09:58:08.277] api_internal~ api_internal/<NOSRV> -1/-1/-1/-1/0 400 
0 - - CR-- 43/1/0/0/3 0/0 {} "<BADREQ>"
Nov 21 09:58:08 loadbalancer-01 haproxy-internal[4053]: 10.205.100.8:53972 
[21/Nov/2017:09:58:08.602] api_internal~ api_internal/<NOSRV> -1/-1/-1/-1/4 400 
0 - - CR-- 43/1/0/0/3 0/0 {} "<BADREQ>"
Nov 21 09:58:09 loadbalancer-01 haproxy-internal[4053]: 10.205.100.23:48776 
[21/Nov/2017:09:58:09.107] api_internal~ api_internal/<NOSRV> -1/-1/-1/-1/3 400 
0 - - CR-- 43/1/0/0/3 0/0 {} "<BADREQ>"
Nov 21 09:58:09 loadbalancer-01 haproxy-internal[4053]: 10.205.100.51:40205 
[21/Nov/2017:09:58:09.765] api_internal~ api_internal/<NOSRV> -1/-1/-1/-1/3 400 
0 - - CR-- 43/1/0/0/3 0/0 {} "<BADREQ>"
Nov 21 09:58:10 loadbalancer-01 haproxy-internal[4053]: 10.205.100.53:54837 
[21/Nov/2017:09:58:10.279] api_internal~ api_internal/<NOSRV> -1/-1/-1/-1/0 400 
0 - - CR-- 43/1/0/0/3 0/0 {} "<BADREQ>"
Nov 21 09:58:10 loadbalancer-01 haproxy-internal[4053]: 10.205.100.8:53978 
[21/Nov/2017:09:58:10.607] api_internal~ api_internal/<NOSRV> -1/-1/-1/-1/5 400 
0 - - CR-- 43/1/0/0/3 0/0 {} "<BADREQ>”

As far as I understand, this is the health check from the different sidecars, 
performing an SSL handshake and then resetting the connection once they are 
satisfied the backend is present. From the central LB’s perspective, though, it 
must look like a broken client, requesting something and then abruptly 
aborting. 

I can see the reasoning behind that, however it makes the statistics counters 
pretty useless, because any monitoring for increased levels of certain HTTP 
response ranges gets totally thrown off by all these health checks.

Is there anything I can do to “make them both happy”?
Any suggestions would be much appreciated.

Thanks,
Daniel



-- 
Daniel Schneller
Principal Cloud Engineer
 
CenterDevice GmbH                  | Hochstraße 11
                                   | 42697 Solingen
tel: +49 1754155711                | Deutschland
[email protected]   | www.centerdevice.de

Geschäftsführung: Dr. Patrick Peschlow, Dr. Lukas Pustina,
Michael Rosbach, Handelsregister-Nr.: HRB 18655,
HR-Gericht: Bonn, USt-IdNr.: DE-815299431

4xx statistics made useless through health checks?

Reply via email to