Backend servers flagged as DOWN a lot & timeout check/connect

Paul Hirose Wed, 06 Jan 2010 09:46:05 -0800

Busy little haproxy beaver today :)

The docs under "retries" says if a connection attempt fails, it waits
one second, and then tries again.  I was wondering how (if at all)
that works in conjunction with "timeout connect", which is how long
haproxy waits to try to connect to a backend server.  Is the one
second delay between retries *after* the "timeout connect" number of
seconds (after all, until "timeout connect" number of seconds has
passed, the connection attempt hasn't failed)?


I stumbled across "timeout check" today.  I've noticed my backend
servers tend to get flagged as DOWN a lot, especially when I first
start or reload haproxy.  Then usually, a few inter (or downinter)
seconds later, it gets flagged as up.  The backend server is
definitely not down during that time.  I suppose it's really not
haproxy itself, but either my own health-check script and/or xinetd
(which launches my health-check script) that might be causing a
problem.

i don't know why it's doing this.  I do notice that whenever I do have
a backend server flagged as down, and I do a ps to look around, there
are a few instances of my health-check script running (or stalled or
whatever.)  After haproxy connects, it waits "timeout check" or
"inter" time for a response before giving up and calling that a
failure?  But since it's launched from xinetd, even though haproxy
might close the connection after "timeout check" (or "inter") amount
of time, I think the health check script process continues to stick
around until it's done.

I was thinking I might try setting "fastinter 1s" and "timeout check
900" (milliseconds, I think, by default), and "fall 4".  So if, for
some reason, a check fails (my script, xinetd, backend server, etc
"stalls"), then it'll only wait 900ms.  Then it'll try again 1s later.
 I figure w/in (900ms + 1s) later, it might be ok and respond back
properly (ignoring why it may have failed the first time.)  Not the
cleanest way, but if anyone has suggestions, I'd welcome them.

I tried using 1.4dev5 rather than the stable 1.3.22.  I noticed 1.4d5
shows  more diagnostics. in my /var/log/messages.  This is what I see
when I do the -sf option.    I also noticed it jumps a PID.  15286 is
the old process.  I run haproxy with -sf and it starts a new process
21905.  The old one pauses the proxy, the new one starts the proxy,
and then the old one finally stops.  The new one, I guess, tries to
bring up or checks the status of the backend servers of one  farm, and
thinks they're all down because of socket error.  But then it changes
PID to 21906, and starts checking the backend servers of another farm.
 From there, it stays running as this new PID.

Jan  6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPFarm.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPSFarm.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPFarm started.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPSFarm started.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPFarm in 0 ms.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPSFarm in 0 ms.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPFarm stopped.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPSFarm stopped.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, check duration: 46ms. 1 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, check duration: 41ms. 0 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:41 lbtest1 haproxy[21905]: proxy LDAPFarm has no server available!
Jan  6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is
DOWN, reason: Socket error, check duration: 277ms. 1 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is
DOWN, reason: Socket error, check duration: 407ms. 0 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:42 lbtest1 haproxy[21906]: proxy LDAPSFarm has no server
available!
Jan  6 09:37:47 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP2 is UP,
reason: Layer7 check passed, code: 200, info: "OK", check duration:
354ms. 1 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.
Jan  6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is UP,
reason: Layer7 check passed, code: 200, info: "OK", check duration:
595ms. 1 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.
Jan  6 09:37:49 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP1 is UP,
reason: Layer7 check passed, code: 200, info: "OK", check duration:
572ms. 2 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.
Jan  6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is UP,
reason: Layer7 check passed, code: 200, info: "OK", check duration:
595ms. 2 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.

My config file is
global
        log /dev/log local0
        user daemon
        group daemon
        pidfile /var/run/haproxy.pid
        spread-checks 20

defaults
        mode tcp
        log global
        option tcplog
        option dontlognull
#       option dontlog-normal
        option redispatch
        retries 3
        maxconn 4096

listen LDAPFarm AAAA::389
        mode tcp
        option tcplog
        option httpchk
        balance roundrobin
        timeout connect 5s
        timeout client 5s
        timeout server 5s
        timeout check 900ms
        server LDAP1 BBBB:389 check addr localhost port 9101 inter 5s
fastinter 1s downinter 5s fall 2 rise 2
        server LDAP2 CCCC:389 check addr localhost port 9102 inter 5s
fastinter 1s downinter 5s fall 2 rise 2

listen LDAPSFarm AAAA:636
        mode tcp
        option tcplog
        option httpchk
        balance roundrobin
        timeout connect 5s
        timeout client 5s
        timeout server 5s
        timeout check 900ms
        server LDAPS1 BBBB:636 check addr localhost port 9201 inter 5s
fastinter 1s downinter 5s fall 2 rise 2
        server LDAPS2 BBBB:636 check addr localhost port 9202 inter 5s
fastinter 1s downinter 5s fall 2 rise 2



Thank you,
PH

Backend servers flagged as DOWN a lot & timeout check/connect

Reply via email to