Re: HAProxy 1.5 incorrectly marks servers as DOWN

Pavlos Parissis Thu, 04 Sep 2014 13:36:31 -0700

On 04/09/2014 08:55 πμ, Juho Mäkinen wrote:
> I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
> problem where haproxy marks some backend servers as being DOWN with a
> message "L4TOUT in 2000ms". Some times the message also has a star: "*
> L4TOUT in 2000ms" (I didn't find what the star means from the docs).
> Also the reported timeout varies between 2000ms and 2003ms.
>


L4TOUT status while you have httpchk enabled it means that HAProxy
failed to establish a TCP connection within 2secs.

Are you sure that you haven't reached any sort of limits on your backend
servers? Number of open files and etc...

> This does not happen to every backend and it doesn't happen immediately.
> After restart every backend is green and a few backends starts to get
> marked DOWN after about 30 minutes or so. I'm also running two instances
> in two different servers and they both suffer the same problem but the
> DOWN servers aren't same. So server A might be marked DOWN on haproxy-1
> and server B marked down on haproxy-2 (or vice versa).
> 
> This seems to happen regardless how much traffic I run into the
> haproxies. I can always ssh into the haproxies and run curl against the
> check url and it always works, so this problem seems to be inside haproxy.
> 

Are you sure that backend servers return a response with HTTP status 200
on healtchecks?

> My haproxy config is a kind of long so I copied it here:
> http://koti.kapsi.fi/garo/nobackup/haproxy.cfg (I've sanitised it a bit,
> but only hostnames).
> 

You have only 1 stats server while you have 7 processes. You need to
enable for each process a stats socket, here is an example from a 24
processes

    stats socket /var/lib/haproxy/stats1 uid 0 gid 0 mode 0440 level
admin process 1
    stats socket /var/lib/haproxy/stats2 uid 0 gid 0 mode 0440 level
admin process 2
    stats socket /var/lib/haproxy/stats3 uid 0 gid 0 mode 0440 level
admin process 3
    stats socket /var/lib/haproxy/stats4 uid 0 gid 0 mode 0440 level
admin process 4
    stats socket /var/lib/haproxy/stats5 uid 0 gid 0 mode 0440 level
admin process 5
    stats socket /var/lib/haproxy/stats6 uid 0 gid 0 mode 0440 level
admin process 6
    stats socket /var/lib/haproxy/stats7 uid 0 gid 0 mode 0440 level
admin process 7
    stats socket /var/lib/haproxy/stats8 uid 0 gid 0 mode 0440 level
admin process 8
    stats socket /var/lib/haproxy/stats9 uid 0 gid 0 mode 0440 level
admin process 9
    stats socket /var/lib/haproxy/stats10 uid 0 gid 0 mode 0440 level
admin process 10
    stats socket /var/lib/haproxy/stats11 uid 0 gid 0 mode 0440 level
admin process 11
    stats socket /var/lib/haproxy/stats12 uid 0 gid 0 mode 0440 level
admin process 12
    stats socket /var/lib/haproxy/stats13 uid 0 gid 0 mode 0440 level
admin process 13
    stats socket /var/lib/haproxy/stats14 uid 0 gid 0 mode 0440 level
admin process 14
    stats socket /var/lib/haproxy/stats15 uid 0 gid 0 mode 0440 level
admin process 15
    stats socket /var/lib/haproxy/stats16 uid 0 gid 0 mode 0440 level
admin process 16
    stats socket /var/lib/haproxy/stats17 uid 0 gid 0 mode 0440 level
admin process 17
    stats socket /var/lib/haproxy/stats18 uid 0 gid 0 mode 0440 level
admin process 18
    stats socket /var/lib/haproxy/stats19 uid 0 gid 0 mode 0440 level
admin process 19
    stats socket /var/lib/haproxy/stats20 uid 0 gid 0 mode 0440 level
admin process 20
    stats socket /var/lib/haproxy/stats21 uid 0 gid 0 mode 0440 level
admin process 21
    stats socket /var/lib/haproxy/stats22 uid 0 gid 0 mode 0440 level
admin process 22
    stats socket /var/lib/haproxy/stats23 uid 0 gid 0 mode 0440 level
admin process 23
    stats socket /var/lib/haproxy/stats24 uid 0 gid 0 mode 0440 level
admin process 24

    nbproc 24
    cpu-map odd 0-5 12-17
    cpu-map even 6-11 18-23

listen haproxy1
    bind :8081 process 1
    bind :8082 process 2
    bind :8083 process 3
    bind :8084 process 4
    bind :8085 process 5
    bind :8086 process 6
    bind :8087 process 7
    bind :8088 process 8
    bind :8089 process 9
    bind :8090 process 10
    bind :8091 process 11
    bind :8092 process 12
    bind :8093 process 13
    bind :8094 process 14
    bind :8095 process 15
    bind :8096 process 16
    bind :8097 process 17
    bind :8098 process 18
    bind :8099 process 19
    bind :8100 process 20
    bind :8101 process 21
    bind :8102 process 22
    bind :8103 process 23
    bind :8104 process 24
    stats uri /
    stats show-node
    stats refresh 10s
    stats show-legends


and then check all them to find which process marks the server down.
> I've ran the logging with verbose debugging to check if that gives any
> clues on the health check issue, but the logs did not reveal anything to
> my eye. I can however gather a new log sample on the health checks, but
> the haproxies are now receiving production traffic so the log amount
> would be too much to gather at the current moment.
> 
> I've also gathered some tcpdump traffic to the hosts marked DOWN and
> strangely it seems that the hosts is receiving queries. It could be that
> one (or more) processes (I'm using nbprocs 7 on my 8 core aws c3.2xlarge
> instance) haven't marked the host down. Trying to refresh the stats uri
> doesn't seem to indicate this, but it's hard to be sure as the
> probability of going thru all seven different processes fast enough is low.
> 
> All clues and debugging ideas are greatly appreciated.

signature.asc
Description: OpenPGP digital signature

Re: HAProxy 1.5 incorrectly marks servers as DOWN

Reply via email to