On 04/09/2014 08:55 πμ, Juho Mäkinen wrote: > I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious > problem where haproxy marks some backend servers as being DOWN with a > message "L4TOUT in 2000ms". Some times the message also has a star: "* > L4TOUT in 2000ms" (I didn't find what the star means from the docs). > Also the reported timeout varies between 2000ms and 2003ms. >
L4TOUT status while you have httpchk enabled it means that HAProxy failed to establish a TCP connection within 2secs. Are you sure that you haven't reached any sort of limits on your backend servers? Number of open files and etc... > This does not happen to every backend and it doesn't happen immediately. > After restart every backend is green and a few backends starts to get > marked DOWN after about 30 minutes or so. I'm also running two instances > in two different servers and they both suffer the same problem but the > DOWN servers aren't same. So server A might be marked DOWN on haproxy-1 > and server B marked down on haproxy-2 (or vice versa). > > This seems to happen regardless how much traffic I run into the > haproxies. I can always ssh into the haproxies and run curl against the > check url and it always works, so this problem seems to be inside haproxy. > Are you sure that backend servers return a response with HTTP status 200 on healtchecks? > My haproxy config is a kind of long so I copied it here: > http://koti.kapsi.fi/garo/nobackup/haproxy.cfg (I've sanitised it a bit, > but only hostnames). > You have only 1 stats server while you have 7 processes. You need to enable for each process a stats socket, here is an example from a 24 processes stats socket /var/lib/haproxy/stats1 uid 0 gid 0 mode 0440 level admin process 1 stats socket /var/lib/haproxy/stats2 uid 0 gid 0 mode 0440 level admin process 2 stats socket /var/lib/haproxy/stats3 uid 0 gid 0 mode 0440 level admin process 3 stats socket /var/lib/haproxy/stats4 uid 0 gid 0 mode 0440 level admin process 4 stats socket /var/lib/haproxy/stats5 uid 0 gid 0 mode 0440 level admin process 5 stats socket /var/lib/haproxy/stats6 uid 0 gid 0 mode 0440 level admin process 6 stats socket /var/lib/haproxy/stats7 uid 0 gid 0 mode 0440 level admin process 7 stats socket /var/lib/haproxy/stats8 uid 0 gid 0 mode 0440 level admin process 8 stats socket /var/lib/haproxy/stats9 uid 0 gid 0 mode 0440 level admin process 9 stats socket /var/lib/haproxy/stats10 uid 0 gid 0 mode 0440 level admin process 10 stats socket /var/lib/haproxy/stats11 uid 0 gid 0 mode 0440 level admin process 11 stats socket /var/lib/haproxy/stats12 uid 0 gid 0 mode 0440 level admin process 12 stats socket /var/lib/haproxy/stats13 uid 0 gid 0 mode 0440 level admin process 13 stats socket /var/lib/haproxy/stats14 uid 0 gid 0 mode 0440 level admin process 14 stats socket /var/lib/haproxy/stats15 uid 0 gid 0 mode 0440 level admin process 15 stats socket /var/lib/haproxy/stats16 uid 0 gid 0 mode 0440 level admin process 16 stats socket /var/lib/haproxy/stats17 uid 0 gid 0 mode 0440 level admin process 17 stats socket /var/lib/haproxy/stats18 uid 0 gid 0 mode 0440 level admin process 18 stats socket /var/lib/haproxy/stats19 uid 0 gid 0 mode 0440 level admin process 19 stats socket /var/lib/haproxy/stats20 uid 0 gid 0 mode 0440 level admin process 20 stats socket /var/lib/haproxy/stats21 uid 0 gid 0 mode 0440 level admin process 21 stats socket /var/lib/haproxy/stats22 uid 0 gid 0 mode 0440 level admin process 22 stats socket /var/lib/haproxy/stats23 uid 0 gid 0 mode 0440 level admin process 23 stats socket /var/lib/haproxy/stats24 uid 0 gid 0 mode 0440 level admin process 24 nbproc 24 cpu-map odd 0-5 12-17 cpu-map even 6-11 18-23 listen haproxy1 bind :8081 process 1 bind :8082 process 2 bind :8083 process 3 bind :8084 process 4 bind :8085 process 5 bind :8086 process 6 bind :8087 process 7 bind :8088 process 8 bind :8089 process 9 bind :8090 process 10 bind :8091 process 11 bind :8092 process 12 bind :8093 process 13 bind :8094 process 14 bind :8095 process 15 bind :8096 process 16 bind :8097 process 17 bind :8098 process 18 bind :8099 process 19 bind :8100 process 20 bind :8101 process 21 bind :8102 process 22 bind :8103 process 23 bind :8104 process 24 stats uri / stats show-node stats refresh 10s stats show-legends and then check all them to find which process marks the server down. > I've ran the logging with verbose debugging to check if that gives any > clues on the health check issue, but the logs did not reveal anything to > my eye. I can however gather a new log sample on the health checks, but > the haproxies are now receiving production traffic so the log amount > would be too much to gather at the current moment. > > I've also gathered some tcpdump traffic to the hosts marked DOWN and > strangely it seems that the hosts is receiving queries. It could be that > one (or more) processes (I'm using nbprocs 7 on my 8 core aws c3.2xlarge > instance) haven't marked the host down. Trying to refresh the stats uri > doesn't seem to indicate this, but it's hard to be sure as the > probability of going thru all seven different processes fast enough is low. > > All clues and debugging ideas are greatly appreciated.
signature.asc
Description: OpenPGP digital signature