I'm having some trouble interpreting an HAProxy log file line - it doesn't seem
to make much sense to me, so I'm fairly certain that I'm reading it
incorrectly. My full configuration file is at the end of this email.
Symptomatically, I've been seeing situations in which people are briefly
redirected unexpectedly onto a different 'gui' backend server. Its rare, but
when it happens it often appears at the same point in the GUI use flow. This
may be a red herring however.
This morning I was able to reproduce it even after restarting our appserver
proceses and our haproxy process (which had been up for an extended period of
time - thanks!). When it reproduced, I noticed that the backend appserver was
marked as DOWN in the statistics report even though the logfiles gave no
indication that there were any problems. From the user's experience, the system
"hung" waiting for a response for some 40 seconds and then showed a login
screen, which was served when the "incorrect" backend received a page request.
Simply re-requesting the previous page was successful, as the previously
authenticated backend was shown as UP by that point.
In the logfile we see the following entry at the appropriate time:
Jul 26 06:21:09 localhost haproxy[3853]: 172.25.200.177:3124
[26/Jul/2012:06:21:01.826] secure gui/gui2 4993/0/0/-1/7992 -1 0 - - CH--
40/40/1/1/0 0/0 "HEAD /panel/healthcheck.html HTTP/1.1"
This file - while similar to the HAProxy healthcheck URL (and in fact produced
the same way on the backend server) - is actually the one being requested by
our hardware LBs (we use existing hardware LBs to distribute between machines,
and a single haproxy daemon on each box to distribute to appservers within each
machine). It appears that it saw some sort of timeout, but I'm not sure why
that would have caused haproxy to mark the backend as DOWN. We add backend
server markers to our generated pages, so I'm confident that the hardware LB
kept us on the same physical system the entire time.
Needless to say I'm a little confused as to what I'm seeing. If anyone can
shed a little more light on the subject it would be very much appreciated!
Other than this recurrent issue, everything's been working incredibly well.
Configuration file below in full in case there's something odd:
global
maxconn 20000
tune.bufsize 128000
tune.maxrewrite 8192
ulimit-n 60000
stats socket /tmp/haproxy mode 0777 level admin
log 127.0.0.1 local0
defaults
mode http
timeout connect 5s
timeout client 1m
timeout server 10m
option srvtcpka
option http-server-close
option http-pretend-keepalive
option redispatch
option abortonclose
option httpchk HEAD /panel/healthcheck.txt HTTP/1.0
backend gui
balance source
appsession JSESSIONID len 64 timeout 3h
server gui1 172.25.200.53:8080 check maxconn 2000
server gui2 172.25.200.54:8080 check maxconn 2000
backend platform
balance roundrobin
server platform1 172.25.200.50:8080 check maxconn 2000
server platform2 172.25.200.51:8080 check maxconn 2000
server platform3 172.25.200.52:8080 check maxconn 2000
frontend secure
bind 172.25.200.70:8080
maxconn 20000
option forwardfor
option clitcpka
acl javascript url_beg /js
acl widgets url_beg /widgets
acl reports url_beg /datafeed
acl deny url_beg /server-status /balancer-manager /jmx-console
/web-console /invoker
block if deny
default_backend gui
use_backend platform if javascript or widgets or reports
log global
option httplog
option dontlognull
option dontlog-normal
listen stats :7583
mode http
stats uri /