Monsieur Tarreau,

Actually, we are seeing frontend service availability flapping. This morning 
particularly.  Missing from my snippet is the logic for an unplanned outage 
landing page, that our customers were seeing this morning, so it haproxy truly 
is "timing out" and marking each backend as down until there are no backend 
servers available, throwing up the unplanned outage landing page.

I'll send more logs and details when I analyze later.

Regards,
Kevin Lange

----
Kevin M Lange
Mission Operations and Services
NASA EOSDIS Evolution and Development
Intelligence and Information Systems
Raytheon Company

+1 (301) 851-8450 (office)
+1 (301) 807-2457 (cell)
kevin.m.la...@nasa.gov
kla...@raytheon.com

5700 Rivertech Court
Riverdale, Maryland 20737

----- Reply message -----
From: "Willy Tarreau" <w...@1wt.eu>
Date: Thu, May 24, 2012 5:18 pm
Subject: Problems with layer7 check timeout
To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]" <kevin.m.la...@nasa.gov>
Cc: "haproxy@formilux.org" <haproxy@formilux.org>

Hi Kevin,

On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> Hi,
> We're having odd behavior (apparently have always but didn't realize it), 
> where our backend httpchks "time out":
>
> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>
>
> We've been playing with the timeout values, and we don't know what is 
> controlling the "Layer7 timeout, check duration: 1002ms".  The backend 
> service availability check (by hand) typically takes 2-3 seconds on average.
> Here is the relevant haproxy setup.
>
> #---------------------------------------------------------------------
> # Global settings
> #---------------------------------------------------------------------
> global
>     log-send-hostname opsslb1
>     log         127.0.0.1 local1 info
> #    chroot      /var/lib/haproxy
>     pidfile     /var/run/haproxy.pid
>     maxconn     1024
>     user        haproxy
>     group       haproxy
>     daemon
>
> #---------------------------------------------------------------------
> # common defaults that all the 'listen' and 'backend' sections will
> # use if not designated in their block
> #---------------------------------------------------------------------
> defaults
>     mode        http
>     log         global
>     option      dontlognull
>     option      httpclose
>     option      httplog
>     option      forwardfor
>     option      redispatch
>     timeout connect 500 # default 10 second time out if a backend is not found
>     timeout client 50000
>     timeout server 3600000
>     maxconn     60000
>     retries     3
>
> frontend webapp_ops_ft
>
>         bind 10.0.40.209:80
>         default_backend webapp_ops_bk
>
> backend webapp_ops_bk
>         balance roundrobin
>         option httpchk HEAD /app/availability
>         reqrep ^Host:.* Host:\ webapp.example.com
>         server webapp_ops1 opsapp1.ops.example.com:41000 check inter 30000
>         server webapp_ops2 opsapp2.ops.example.com:41000 check inter 30000
>         server webapp_ops3 opsapp3.ops.example.com:41000 check inter 30000
>         timeout check 15000
>         timeout connect 15000

This is quite strange. The timeout is defined first by "timeout check" or if
unset, by "inter". So in your case you should observe a 15sec timeout, not
one second.

What exact version is this ? (haproxy -vv)

It looks like a bug, however it could be a bug in the timeout handling as
well as in the reporting. I'd suspect the latter since you're saying that
the service takes 2-3 sec to respond and you don't seem to see errors
that often.

Regards,
Willy

Reply via email to