Monsieur Tarreau, Actually, we are seeing frontend service availability flapping. This morning particularly. Missing from my snippet is the logic for an unplanned outage landing page, that our customers were seeing this morning, so it haproxy truly is "timing out" and marking each backend as down until there are no backend servers available, throwing up the unplanned outage landing page.
I'll send more logs and details when I analyze later. Regards, Kevin Lange ---- Kevin M Lange Mission Operations and Services NASA EOSDIS Evolution and Development Intelligence and Information Systems Raytheon Company +1 (301) 851-8450 (office) +1 (301) 807-2457 (cell) kevin.m.la...@nasa.gov kla...@raytheon.com 5700 Rivertech Court Riverdale, Maryland 20737 ----- Reply message ----- From: "Willy Tarreau" <w...@1wt.eu> Date: Thu, May 24, 2012 5:18 pm Subject: Problems with layer7 check timeout To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]" <kevin.m.la...@nasa.gov> Cc: "haproxy@formilux.org" <haproxy@formilux.org> Hi Kevin, On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY] wrote: > Hi, > We're having odd behavior (apparently have always but didn't realize it), > where our backend httpchks "time out": > > May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is > DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup > servers left. 1 sessions active, 0 requeued, 0 remaining in queue. > May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is > DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup > servers left. 2 sessions active, 0 requeued, 0 remaining in queue. > May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is > DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup > servers left. 1 sessions active, 0 requeued, 0 remaining in queue. > May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is > DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup > servers left. 0 sessions active, 0 requeued, 0 remaining in queue. > May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is > DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup > servers left. 3 sessions active, 0 requeued, 0 remaining in queue. > May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is > DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup > servers left. 0 sessions active, 0 requeued, 0 remaining in queue. > May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is > DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup > servers left. 1 sessions active, 0 requeued, 0 remaining in queue. > May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is > DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup > servers left. 0 sessions active, 0 requeued, 0 remaining in queue. > May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is > DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup > servers left. 0 sessions active, 0 requeued, 0 remaining in queue. > > > We've been playing with the timeout values, and we don't know what is > controlling the "Layer7 timeout, check duration: 1002ms". The backend > service availability check (by hand) typically takes 2-3 seconds on average. > Here is the relevant haproxy setup. > > #--------------------------------------------------------------------- > # Global settings > #--------------------------------------------------------------------- > global > log-send-hostname opsslb1 > log 127.0.0.1 local1 info > # chroot /var/lib/haproxy > pidfile /var/run/haproxy.pid > maxconn 1024 > user haproxy > group haproxy > daemon > > #--------------------------------------------------------------------- > # common defaults that all the 'listen' and 'backend' sections will > # use if not designated in their block > #--------------------------------------------------------------------- > defaults > mode http > log global > option dontlognull > option httpclose > option httplog > option forwardfor > option redispatch > timeout connect 500 # default 10 second time out if a backend is not found > timeout client 50000 > timeout server 3600000 > maxconn 60000 > retries 3 > > frontend webapp_ops_ft > > bind 10.0.40.209:80 > default_backend webapp_ops_bk > > backend webapp_ops_bk > balance roundrobin > option httpchk HEAD /app/availability > reqrep ^Host:.* Host:\ webapp.example.com > server webapp_ops1 opsapp1.ops.example.com:41000 check inter 30000 > server webapp_ops2 opsapp2.ops.example.com:41000 check inter 30000 > server webapp_ops3 opsapp3.ops.example.com:41000 check inter 30000 > timeout check 15000 > timeout connect 15000 This is quite strange. The timeout is defined first by "timeout check" or if unset, by "inter". So in your case you should observe a 15sec timeout, not one second. What exact version is this ? (haproxy -vv) It looks like a bug, however it could be a bug in the timeout handling as well as in the reporting. I'd suspect the latter since you're saying that the service takes 2-3 sec to respond and you don't seem to see errors that often. Regards, Willy