Busy little haproxy beaver today :) The docs under "retries" says if a connection attempt fails, it waits one second, and then tries again. I was wondering how (if at all) that works in conjunction with "timeout connect", which is how long haproxy waits to try to connect to a backend server. Is the one second delay between retries *after* the "timeout connect" number of seconds (after all, until "timeout connect" number of seconds has passed, the connection attempt hasn't failed)?
I stumbled across "timeout check" today. I've noticed my backend servers tend to get flagged as DOWN a lot, especially when I first start or reload haproxy. Then usually, a few inter (or downinter) seconds later, it gets flagged as up. The backend server is definitely not down during that time. I suppose it's really not haproxy itself, but either my own health-check script and/or xinetd (which launches my health-check script) that might be causing a problem. i don't know why it's doing this. I do notice that whenever I do have a backend server flagged as down, and I do a ps to look around, there are a few instances of my health-check script running (or stalled or whatever.) After haproxy connects, it waits "timeout check" or "inter" time for a response before giving up and calling that a failure? But since it's launched from xinetd, even though haproxy might close the connection after "timeout check" (or "inter") amount of time, I think the health check script process continues to stick around until it's done. I was thinking I might try setting "fastinter 1s" and "timeout check 900" (milliseconds, I think, by default), and "fall 4". So if, for some reason, a check fails (my script, xinetd, backend server, etc "stalls"), then it'll only wait 900ms. Then it'll try again 1s later. I figure w/in (900ms + 1s) later, it might be ok and respond back properly (ignoring why it may have failed the first time.) Not the cleanest way, but if anyone has suggestions, I'd welcome them. I tried using 1.4dev5 rather than the stable 1.3.22. I noticed 1.4d5 shows more diagnostics. in my /var/log/messages. This is what I see when I do the -sf option. I also noticed it jumps a PID. 15286 is the old process. I run haproxy with -sf and it starts a new process 21905. The old one pauses the proxy, the new one starts the proxy, and then the old one finally stops. The new one, I guess, tries to bring up or checks the status of the backend servers of one farm, and thinks they're all down because of socket error. But then it changes PID to 21906, and starts checking the backend servers of another farm. From there, it stays running as this new PID. Jan 6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPFarm. Jan 6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPSFarm. Jan 6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPFarm started. Jan 6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPSFarm started. Jan 6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPFarm in 0 ms. Jan 6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPSFarm in 0 ms. Jan 6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPFarm stopped. Jan 6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPSFarm stopped. Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN, reason: Socket error, check duration: 46ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN, reason: Socket error, check duration: 41ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:41 lbtest1 haproxy[21905]: proxy LDAPFarm has no server available! Jan 6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is DOWN, reason: Socket error, check duration: 277ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is DOWN, reason: Socket error, check duration: 407ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:42 lbtest1 haproxy[21906]: proxy LDAPSFarm has no server available! Jan 6 09:37:47 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 354ms. 1 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. Jan 6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 595ms. 1 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. Jan 6 09:37:49 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP1 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 572ms. 2 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. Jan 6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 595ms. 2 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. My config file is global log /dev/log local0 user daemon group daemon pidfile /var/run/haproxy.pid spread-checks 20 defaults mode tcp log global option tcplog option dontlognull # option dontlog-normal option redispatch retries 3 maxconn 4096 listen LDAPFarm AAAA::389 mode tcp option tcplog option httpchk balance roundrobin timeout connect 5s timeout client 5s timeout server 5s timeout check 900ms server LDAP1 BBBB:389 check addr localhost port 9101 inter 5s fastinter 1s downinter 5s fall 2 rise 2 server LDAP2 CCCC:389 check addr localhost port 9102 inter 5s fastinter 1s downinter 5s fall 2 rise 2 listen LDAPSFarm AAAA:636 mode tcp option tcplog option httpchk balance roundrobin timeout connect 5s timeout client 5s timeout server 5s timeout check 900ms server LDAPS1 BBBB:636 check addr localhost port 9201 inter 5s fastinter 1s downinter 5s fall 2 rise 2 server LDAPS2 BBBB:636 check addr localhost port 9202 inter 5s fastinter 1s downinter 5s fall 2 rise 2 Thank you, PH