Re: [ANNOUNCE] haproxy 1.4-dev5 with keep-alive :-) memory comsuption

2010-01-06 Thread Hank A. Paulson
Definitely haproxy process, nothing else runs on there and the older version 
remains stable for days/weeks:


F S UIDPID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY  TIME CMD
1 S nobody   15547 1 18  80   0 - 1026097 epoll_ 10:54 ?  00:54:30 
/usr/sbin/haproxy14d5 -D -f /etc/haproxy/haproxyka.cfg -p /var/run/haproxy.pid 
-sf 15536
1 S nobody   20631 1 29  80   0 - 17843 epoll_ 13:48 ?00:33:37 
/usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf 15547


On 1/5/10 11:10 PM, Willy Tarreau wrote:

On Tue, Jan 05, 2010 at 11:00:30PM -0800, Hank A. Paulson wrote:

Using git 034550b7420c24625a975f023797d30a14b80830
[BUG] stats: show UP/DOWN status also in tracking servers 6 hours ago...

I am still seeing continuous memory consumption (about 1+ GB/hr) at 50-60
Mbps even after the number of connections has stablized:


OK. Is this memory used by the haproxy process itself ?
If so, could you please send me your exact configuration so that
I may have a chance to spot something in the code related to what
you use ? A memory leak is something very unlikely in haproxy, though
it's not impossible. Everything works with pools which are released
when the session closes. But maybe something in this area escaped
from my radar (eg: header captures in keep-alive, etc...).


  69 CLOSE_WAIT
   9 CLOSING
4807 ESTABLISHED
  35 FIN_WAIT1
   4 FIN_WAIT2
 255 LAST_ACK
  10 LISTEN
3410 SYN_RECV


This one is really impressive. 3410 SYN_RECV basically means you're
under a SYN flood, or your network stack is not correctly tuned and
you're slowing down your users a lot because they need to wait 3s
before retransmitting.

Regards,
Willy


Thanks, we pride ourselves on our huge SYN queue...   :)



Re: [PATCH 4/5] [MEDIUM] default-server support

2010-01-06 Thread Krzysztof Olędzki

On 2010-01-06 00:44, Willy Tarreau wrote:

Hi Krzysztof,

Hi Willy,


I've merged all of your patches.


Thanks.


However I have a minor concern
about something in this one :


-The currently supported settings are the following ones.
+The currently supported settings are the following ones, the ones marked with
+[D] are also upported for default-server.


(...)


-error-limit count
+[D] error-limit count


(...)


-fall count
+[D] fall count


etc...

While I understand the reason you have added this tag, it breaks the
ability to search a keyword at the beginning of line using ^error-limit.
Maybe we should put the tag at the end of the line (77th to 79th chars)
or maybe we should simply suggest that everything is supported in defaults
except those explicitly stated otherwise, and just add a one line comment
for those keywords not supported in default-server ?


How about adding Supported in default-server: Yes/No into each keyword?


If you get a nice idea on this subject, feel free to send a patch for it.
If you do so, please also add the following hunk that I wanted to fix but
forgot at the last minute :

diff --git a/doc/configuration.txt b/doc/configuration.txt
index eaec73b..c6cfa98 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -724,7 +724,7 @@ capture response header -  X X -
 clitimeout  X  X X -  (deprecated)
 contimeout  X  - X X  (deprecated)
 cookie  X  - X X
-default-server  X  - X -
+default-server  X  - X X
 default_backend -  X X -
 description -  X X X
 disabledX  X X X


Sure. I have no idea why I did it wrong. :|

Best regards,

Krzysztof Olędzki




[PATCH] [BUG] stats: cookie should be reported under backend not under proxy

2010-01-06 Thread Krzysztof Piotr Oledzki
From 7046fe6b5245deb06f760896fa7e10c2163eda60 Mon Sep 17 00:00:00 2001
From: Krzysztof Piotr Oledzki o...@ans.pl
Date: Wed, 6 Jan 2010 15:03:18 +0100
Subject: [BUG] stats: cookie should be reported under backend not under proxy

---
 src/dumpstats.c |   29 -
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/src/dumpstats.c b/src/dumpstats.c
index a9422e2..495e280 100644
--- a/src/dumpstats.c
+++ b/src/dumpstats.c
@@ -1239,17 +1239,6 @@ int stats_dump_proxy(struct session *s, struct proxy 
*px, struct uri_auth *uri)
proxy_cap_str(px-cap), 
proxy_mode_str(px-mode),
px-uuid);
 
-   /* cookie */
-   if (px-cookie_name) {
-   struct chunk src;
-
-   chunk_printf(msg, , cookie: ');
-   chunk_initlen(src, px-cookie_name, 0, 
strlen(px-cookie_name));
-   chunk_htmlencode(msg, src);
-
-   chunk_printf(msg, ');
-   }
-
chunk_printf(msg, \);
}
 
@@ -1897,9 +1886,23 @@ int stats_dump_proxy(struct session *s, struct proxy 
*px, struct uri_auth *uri)
 
if (uri-flagsST_SHLGNDS) {
/* balancing */
-
-chunk_printf(msg,  
title=\balancing: %s\,
+chunk_printf(msg,  
title=\balancing: %s,
 
backend_lb_algo_str(px-lbprm.algo  BE_LB_ALGO));
+
+   /* cookie */
+   if (px-cookie_name) {
+   struct chunk src;
+
+   chunk_printf(msg, , cookie: 
');
+
+   chunk_initlen(src, 
px-cookie_name, 0, strlen(px-cookie_name));
+   chunk_htmlencode(msg, src);
+
+   chunk_printf(msg, ');
+   }
+
+   chunk_printf(msg, \);
+
}
 
chunk_printf(msg,
-- 
1.6.4.2




[PATCH] [BUG] cfgparser/stats: fix error message

2010-01-06 Thread Krzysztof Piotr Oledzki
From 0d95cf9f607de59487c644af3077be1a84eb4b81 Mon Sep 17 00:00:00 2001
From: Krzysztof Piotr Oledzki o...@ans.pl
Date: Wed, 6 Jan 2010 16:25:05 +0100
Subject: [BUG] cfgparser/stats: fix error message

Fix the error message by unification and goto, previously we had
two independent lists of supported keywords and were raporting 'stats'
instead of a wrong keyword.

Code:
 stats wrong-keyword
 stats

Before:
 [ALERT] 005/163032 (27175) : parsing [haproxy.cfg:248] : unknown stats 
parameter 'stats' (expects 'hide-version', 'uri', 'realm', 'auth' or 'enable').
 [ALERT] 005/163032 (27175) : parsing [haproxy.cfg:249] : 'stats' expects 
'uri', 'realm', 'auth', 'scope' or 'enable', 'hide-version', 'show-node', 
'show-desc', 'show-legends'.

After:
 [ALERT] 005/162841 (22710) : parsing [haproxy.cfg:248]: unknown stats 
parameter 'wrong-keyword', expects 'uri', 'realm', 'auth', 'scope', 'enable', 
'hide-version', 'show-node', 'show-desc' or 'show-legends'.
 [ALERT] 005/162841 (22710) : parsing [haproxy.cfg:249]: missing keyword in 
'stats', expects 'uri', 'realm', 'auth', 'scope', 'enable', 'hide-version', 
'show-node', 'show-desc' or 'show-legends'.
---
 src/cfgparse.c |   11 +--
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/cfgparse.c b/src/cfgparse.c
index c0d6dfe..f3cfc61 100644
--- a/src/cfgparse.c
+++ b/src/cfgparse.c
@@ -1978,10 +1978,8 @@ int cfg_parse_listen(const char *file, int linenum, char 
**args, int kwm)
if (curproxy != defproxy  curproxy-uri_auth == 
defproxy.uri_auth)
curproxy-uri_auth = NULL; /* we must detach from the 
default config */
 
-   if (*(args[1]) == 0) {
-   Alert(parsing [%s:%d] : '%s' expects 'uri', 'realm', 
'auth', 'scope' or 'enable', 'hide-version', 'show-node', 'show-desc', 
'show-legends'.\n, file, linenum, args[0]);
-   err_code |= ERR_ALERT | ERR_FATAL;
-   goto out;
+   if (!*args[1]) {
+   goto stats_error_parsing;
} else if (!strcmp(args[1], uri)) {
if (*(args[2]) == 0) {
Alert(parsing [%s:%d] : 'uri' needs an URI 
prefix.\n, file, linenum);
@@ -2110,8 +2108,9 @@ int cfg_parse_listen(const char *file, int linenum, char 
**args, int kwm)
free(desc);
}
} else {
-   Alert(parsing [%s:%d] : unknown stats parameter '%s' 
(expects 'hide-version', 'uri', 'realm', 'auth' or 'enable').\n,
- file, linenum, args[0]);
+stats_error_parsing:
+   Alert(parsing [%s:%d]: %s '%s', expects 'uri', 
'realm', 'auth', 'scope', 'enable', 'hide-version', 'show-node', 'show-desc' or 
'show-legends'.\n,
+ file, linenum, *args[1]?unknown stats 
parameter:missing keyword in, args[*args[1]?1:0]);
err_code |= ERR_ALERT | ERR_FATAL;
goto out;
}
-- 
1.6.4.2




Fwd: Per-Server arguments for httpchk health-checking

2010-01-06 Thread Paul Hirose
I sent this out yesterday (1/5/2010) but didn't see it come back to me
on the list, nor do I see it in the mailing list archives at
http://www.formilux.org/archives/haproxy/1001/date.html so I'm trying
again.  The other msg I sent yesterday (about Retries/Option Dispatch)
did make it to the list (it's in the archive), although I didn't see
that come to me in my email (sigh).

I do occasionally get notices from the list manager about how an email
to me bounced.  If the list manager could send me a copy of what I
bounced back, that'd be great, so I can track it down.

Thank you,
PH


-- Forwarded message --
From: Paul Hirose paulhir...@gmail.com
Date: Tue, Jan 5, 2010 at 10:42 AM
Subject: Per-Server arguments for httpchk health-checking
To: haproxy@formilux.org


Is there a way to have haproxy send a per-server argument during a
health-check?  Right now, I do server A check addr localhost 9000
and server B check addr localhost 9001 and so on, and have xinetd
monitor 9000/TCP and 9001/TCP.  When haproxy connects to those, xinetd
in turn runs a health-check script I wrote that actually does the
checking.  When done checking the script returns either HTTP200 or
HTTP500 depending on if it's good or not.

This means I start using one port per back-end server.  If I have 10
server that haproxy spreads the load across, I have server ... 9002
and server ... 9003 and so on through 9009.  I then have 10
different copies of my little script, each of which just connects to a
different server (A, B, C...) and does the check.

If haproxy could do something like server A check addr localhost 9000
argument A, I could have only one port watched by xinetd, and only
one copy of the script, that would simply accept A as an argument of
the server it should check.

I use option httpchk with the above (all in the same listen group).
But I wasn't sure if I could change option httpchk for every server.
Could I do a:
listen farm address:port
  balance roundrobin
  mode tcp
  option httpchk GET 1
  server A 1.1.1.1:389 check addr localhost port 9000 inter 5s
fastinter 1s downinter 120s
  option httpchk GET 2
  server B 1.1.1.2:389 check addr localhost port 9000 inter 5s
fastinter 1s downinter 120s

Would that do a GET 1 when it tries a http health check for server A
and do a GET 2 when doing a health check for server B?  These are
LDAP servers on the back-end, btw.

I can't put the health-check script on the back-end server itself.
And they wanted to make sure the health-check passed from the actual
haproxy load-balancer talking to the back-end server, rather than the
back-end server talking to itself.

Thank you,
PH



Backend servers flagged as DOWN a lot timeout check/connect

2010-01-06 Thread Paul Hirose
Busy little haproxy beaver today :)

The docs under retries says if a connection attempt fails, it waits
one second, and then tries again.  I was wondering how (if at all)
that works in conjunction with timeout connect, which is how long
haproxy waits to try to connect to a backend server.  Is the one
second delay between retries *after* the timeout connect number of
seconds (after all, until timeout connect number of seconds has
passed, the connection attempt hasn't failed)?

I stumbled across timeout check today.  I've noticed my backend
servers tend to get flagged as DOWN a lot, especially when I first
start or reload haproxy.  Then usually, a few inter (or downinter)
seconds later, it gets flagged as up.  The backend server is
definitely not down during that time.  I suppose it's really not
haproxy itself, but either my own health-check script and/or xinetd
(which launches my health-check script) that might be causing a
problem.

i don't know why it's doing this.  I do notice that whenever I do have
a backend server flagged as down, and I do a ps to look around, there
are a few instances of my health-check script running (or stalled or
whatever.)  After haproxy connects, it waits timeout check or
inter time for a response before giving up and calling that a
failure?  But since it's launched from xinetd, even though haproxy
might close the connection after timeout check (or inter) amount
of time, I think the health check script process continues to stick
around until it's done.

I was thinking I might try setting fastinter 1s and timeout check
900 (milliseconds, I think, by default), and fall 4.  So if, for
some reason, a check fails (my script, xinetd, backend server, etc
stalls), then it'll only wait 900ms.  Then it'll try again 1s later.
 I figure w/in (900ms + 1s) later, it might be ok and respond back
properly (ignoring why it may have failed the first time.)  Not the
cleanest way, but if anyone has suggestions, I'd welcome them.

I tried using 1.4dev5 rather than the stable 1.3.22.  I noticed 1.4d5
shows  more diagnostics. in my /var/log/messages.  This is what I see
when I do the -sf option.I also noticed it jumps a PID.  15286 is
the old process.  I run haproxy with -sf and it starts a new process
21905.  The old one pauses the proxy, the new one starts the proxy,
and then the old one finally stops.  The new one, I guess, tries to
bring up or checks the status of the backend servers of one  farm, and
thinks they're all down because of socket error.  But then it changes
PID to 21906, and starts checking the backend servers of another farm.
 From there, it stays running as this new PID.

Jan  6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPFarm.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPSFarm.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPFarm started.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPSFarm started.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPFarm in 0 ms.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPSFarm in 0 ms.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPFarm stopped.
Jan  6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPSFarm stopped.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, check duration: 46ms. 1 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, check duration: 41ms. 0 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:41 lbtest1 haproxy[21905]: proxy LDAPFarm has no server available!
Jan  6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is
DOWN, reason: Socket error, check duration: 277ms. 1 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is
DOWN, reason: Socket error, check duration: 407ms. 0 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:42 lbtest1 haproxy[21906]: proxy LDAPSFarm has no server
available!
Jan  6 09:37:47 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP2 is UP,
reason: Layer7 check passed, code: 200, info: OK, check duration:
354ms. 1 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.
Jan  6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is UP,
reason: Layer7 check passed, code: 200, info: OK, check duration:
595ms. 1 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.
Jan  6 09:37:49 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP1 is UP,
reason: Layer7 check passed, code: 200, info: OK, check duration:
572ms. 2 active and 0 backup servers left. 0 sessions active, -1
requeued, 0 remaining in queue.
Jan  6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is UP,
reason: Layer7 check passed, code: 200, info: OK, check duration:
595ms. 2 active and 0 backup servers left. 0 

Re: [ANNOUNCE] haproxy 1.4-dev5 with keep-alive :-)

2010-01-06 Thread Cyril Bonté
Le Mardi 5 Janvier 2010 23:42:46, Willy Tarreau a écrit :
 On Tue, Jan 05, 2010 at 11:14:32PM +0100, Cyril Bonté wrote:
  Well, eventually after several different tests, that's OK for me.
  A short http-request timeout (some seconds max) will prevent the 
  accumulation of connections ESTABLISHED in the haproxy-client side 
  (which use sessions in haproxy that will never read anything) but 
  inexistant in the client-haproxy side.
 
 indeed, and against this you can also use option abortonclose which
 will simply abort the requests if their input channel is empty before
 a connection is established.

Sadly not, for this specific case. After 20 hours (no timeout were set), the 
connection remained established.

-- 
Cyril Bonté



Re: [ANNOUNCE] haproxy 1.4-dev5 with keep-alive :-)

2010-01-06 Thread Willy Tarreau
Hi Cyril,

On Wed, Jan 06, 2010 at 08:58:17PM +0100, Cyril Bonté wrote:
 Le Mardi 5 Janvier 2010 23:42:46, Willy Tarreau a écrit :
  On Tue, Jan 05, 2010 at 11:14:32PM +0100, Cyril Bonté wrote:
   Well, eventually after several different tests, that's OK for me.
   A short http-request timeout (some seconds max) will prevent the 
   accumulation of connections ESTABLISHED in the haproxy-client side 
   (which use sessions in haproxy that will never read anything) but 
   inexistant in the client-haproxy side.
  
  indeed, and against this you can also use option abortonclose which
  will simply abort the requests if their input channel is empty before
  a connection is established.
 
 Sadly not, for this specific case. After 20 hours (no timeout were set), the 
 connection remained established.

This morning I found one good reason for those issues. I cannot
reproduce them on the lab but they slowly accumulate on one of
the two prod servers (about 10-20 per day).

The cause lies in the way the analysers are re-enabled to parse
a second request. My assumption was that once enabling one analyser
from another one, it would automatically be called, but that's not
the case if the target one was already called in the same round.
And unfortunately, by the time it is enabled, there is no more I/O
on the socket so it never gets woken up again.

So I have started to see how to correctly call an analyser once,
then only if it is re-enabled by another analyser. Now I think I
have found the right logics for this, I just need to run it by
hand first to ensure it's ok, then implement.

Additionally, I have noticed some dangerous changes on the
BF_DONT_READ flag under some circumstances, which sometimes
disable any further reading on a socket. That combined with
the issue above can theorically definitely freeze a socket.

I have looked at the sessions status by connecting to the
stats socket, here's what I got :

 show sess
0x816f068: proto=tcpv4 src=XX.XX.XXX.XXX:44578 fe=public be=public srv=none 
ts=08 age=11h21m calls=12 rq[f=1501000h,l=783,an=0eh,rx=,wx=,ax=] 
rp[f=2001000h,l=0,an=00h,rx=,wx=,ax=] s0=[7,18h,fd=2,ex=] s1=[0,0h,fd=-1,ex=] 
exp=

rq.f = 1501000h = 1=BF_DONT_READ
rq.an = 0eh = 3 analysers including http_wait_request(), which
would have cleared BF_DONT_READ if it were called again.

So I have also made the rules to use BF_DONT_READ stricter so
that we can't leave it set when leaving an analyser. I already
have the patch for that, but without the former fix it will not
bring anything, so I want to fix the other one first, and will
keep you updated.

I will not release -dev6 until it can run for one day on both
prod servers without leaving *any* stuck session, otherwise
that's plain unacceptable. And I want to fix the issues at
their root, not just the symptoms.

Thanks for your tests and feedback,
Willy




Re: [PATCH 2/5] [MINOR] stats: add a link a href for sockets

2010-01-06 Thread Krzysztof Olędzki

On 2010-01-06 20:18, Cyril Bonté wrote:

Hi Krzysztof and Willy,

Le Mardi 5 Janvier 2010 17:08:23, Krzysztof Piotr Oledzki a écrit : 

This patch adds add a link  a href html tags for sockets.
As sockets may have the same name like servers, I decided to
add + char (forbidden in names assigned to servers), as a prefix.


After reading this commit, I've tested the socket-stats option introduced with 
that one :
http://haproxy.1wt.eu/git?p=haproxy.git;a=commit;h=aeebf9ba6574ca5b8c352685546c0799ecd5e259

I think it can be very helpful for diagnostics if the auto-calcultated listener 
name displays the listener address and port instead of its id.
See the patch in attachment if it's OK for you (done with the 20100106 
snapshot) ;)


Good Idea!

However, the same functionality is provided by very recently introduced 
stats show-legends option, but you need to enable it explicitly. I 
didn't make it enabled by default as I don't want to provide such 
information to everyone, who is just able to access stats.


If you enable it, all you need to do is to move your mouse cursor over 
the listener's td. Does this satisfy your needs?



Best regards,

Krzysztof Olędzki



Re: Backend servers flagged as DOWN a lot timeout check/connect

2010-01-06 Thread Paul Hirose
Thank you for your help.:)

2010/1/6 Krzysztof Olędzki o...@ans.pl:
 On 2010-01-06 18:45, Paul Hirose wrote:
 The docs under retries says if a connection attempt fails, it waits
 one second, and then tries again.

 This 1s timeout is only used in case of immediately error (like TCP RST),
 not in case of timeouts.

 I was wondering how (if at all)
 that works in conjunction with timeout connect, which is how long
 haproxy waits to try to connect to a backend server.  Is the one
 second delay between retries *after* the timeout connect number of
 seconds (after all, until timeout connect number of seconds has
 passed, the connection attempt hasn't failed)?

 - above two timeouts are independent,
 - there is no 1s turnaround after a timeout.


So to summarize the timeout issues connecting to the backend server, a
client request comes into haproxy, which is then sent to one of the
backend servers.  If the connection fails immediately, then haproxy
waits 1s and then tries again to the same backend server.  It repeats
this up to retries number of times, or up to retries - 1 amount of
times in option redispatch is set (the last retry being sent to some
other backend server.)

For a non-immediate error (as in just trying to connect and hanging
there) but still not actually making a connection, haproxy will wait
up to timeout connect amount of time.   If after that much time, a
connection still isn't established, haproxy will immediately try to
connect again, rather than waiting 1s and then try connecting again to
the same backend server?

 This is what I see when I do the -sf option.

 CUT

 Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
 reason: Socket error, check duration: 46ms. 1 active and 0 backup
 servers online. 0 sessions requeued, 0 total in queue.

 Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
 reason: Socket error, check duration: 41ms. 0 active and 0 backup
 servers online. 0 sessions requeued, 0 total in queue.

 OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here, which is
 not very unambiguous yet. There are three calls to set_server_check_status()
 with NULL as an additional info:

 $ egrep -R set_s.*HCHK_STATUS_SOCKERR.*NULL src
 src/checks.c:                   set_server_check_status(s,
 HCHK_STATUS_SOCKERR, NULL);
 src/checks.c:
 set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);
 src/checks.c:
 set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);

 Could you please try to change the second NULL to strerror(errno)?

I've made that patch.  I am using 1.4-dev5.tar.gz, but not the snaphot
from 1.4-ss-20100106.tar.gz.  With your three strerror(errno) patches
in, I am now seeing a bit more info in my /var/log/messages:

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, info: Resource temporarily unavailable, check
duration: 51ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, info: Resource temporarily unavailable, check
duration: 41ms. 0 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

I also sometimes get a slightly different error message on occasion:
Jan  6 12:21:49 lbtest1 haproxy[15709]: Server LDAPSFarm/LDAPS1 is
DOWN, reason: Socket error, info: Operation now in progress, check
duration: 277ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

I don't notice a pattern to which backend server health-check gets
either Operation now in progress or Resource temporarily unavailable
error.  It seems random.

 I noticed you are using addr to use localhost as the source address of
 your healt-checks:

        server LDAP1 :389 check addr localhost port 9101 inter 5s
 fastinter 1s downinter 5s fall 2 rise 2
        server LDAP2 :389 check addr localhost port 9102 inter 5s
 fastinter 1s downinter 5s fall 2 rise 2

 I think that could be the source of your problems.

 I'll try to reproduce similar condition in my environment, but before I
 could be able to do this - would you please try to drop addr localhost for
 now and check if it makes any difference?

I need to do the check addr localhost port 9101 for example.  My
health-check scripts actually run on the same computer as haproxy runs
(and not on the backend server.)  I don't have access to the actual
backend server(s) and thus cannot put a health-check script on them.
I changed localhost to 127.0.0.1 just on the offchance there might be
something there.

My xinetd.conf has instances 50, per_source 10, so I figure xinetd
should be able to run multiple copies of my health-check scripts at
one time, if it came to that.  I do have spread-checks 20 in my
haproxy.cfg file, just to try and spread it around.  But I figure a
reload/start of haproxy won't spread the checks around.

Thank you,
PH



# of haproxy processes after a -sf

2010-01-06 Thread Paul Hirose
haproxy process running, processing one very long request.  I run
another haproxy -sf, which does the whole reload, etc stuff.  The long
request (ldap database query) continues on its way. :)  So that's good
:)

But I thought the old process would stick around until that entire
request was done, even though the new haproxy process is now running
and processing all new incoming requests and doing the balancing, etc.

But a quick ps -auxw | grep haproxy shows only one process.  I just
expected two.  I guess it's ok, since the big long request continues
(doesn't get aborted), and the new config file parameters are working
fine, and new incoming connections are being processed :)

Anyway, just thought I'd ask on that :)
Thank you,
PH



Re: [PATCH 2/5] [MINOR] stats: add a link a href for sockets

2010-01-06 Thread Cyril Bonté
Le Mercredi 6 Janvier 2010 21:28:35, Krzysztof Olędzki a écrit :
 On 2010-01-06 20:18, Cyril Bonté wrote:
  Hi Krzysztof and Willy,
  
  Le Mardi 5 Janvier 2010 17:08:23, Krzysztof Piotr Oledzki a écrit : 
  This patch adds add a link  a href html tags for sockets.
  As sockets may have the same name like servers, I decided to
  add + char (forbidden in names assigned to servers), as a prefix.
  
  After reading this commit, I've tested the socket-stats option introduced 
  with that one :
  http://haproxy.1wt.eu/git?p=haproxy.git;a=commit;h=aeebf9ba6574ca5b8c352685546c0799ecd5e259
  
  I think it can be very helpful for diagnostics if the auto-calcultated 
  listener name displays the listener address and port instead of its id.
  See the patch in attachment if it's OK for you (done with the 20100106 
  snapshot) ;)
 
 Good Idea!
 
 However, the same functionality is provided by very recently introduced 
 stats show-legends option, but you need to enable it explicitly. I 
 didn't make it enabled by default as I don't want to provide such 
 information to everyone, who is just able to access stats.

Ah yes, I missed it.


 If you enable it, all you need to do is to move your mouse cursor over 
 the listener's td. Does this satisfy your needs?

Yes, but not completely as the information can't be found in the CSV export, 
which could be useful for monitoring tools.
That said, if I need it, I can manually set the name on the bind lines so it's 
maybe OK.

Thanks.

-- 
Cyril Bonté



Re: Backend servers flagged as DOWN a lot timeout check/connect

2010-01-06 Thread Krzysztof Olędzki

On 2010-01-06 21:31, Paul Hirose wrote:

2010/1/6 Krzysztof Olędzki o...@ans.pl:

On 2010-01-06 18:45, Paul Hirose wrote:

The docs under retries says if a connection attempt fails, it waits
one second, and then tries again.

This 1s timeout is only used in case of immediately error (like TCP RST),
not in case of timeouts.


I was wondering how (if at all)
that works in conjunction with timeout connect, which is how long
haproxy waits to try to connect to a backend server.  Is the one
second delay between retries *after* the timeout connect number of
seconds (after all, until timeout connect number of seconds has
passed, the connection attempt hasn't failed)?

- above two timeouts are independent,
- there is no 1s turnaround after a timeout.



So to summarize the timeout issues connecting to the backend server, a
client request comes into haproxy, which is then sent to one of the
backend servers.  If the connection fails immediately, then haproxy
waits 1s and then tries again to the same backend server.  It repeats
this up to retries number of times, or up to retries - 1 amount of
times in option redispatch is set (the last retry being sent to some
other backend server.)

For a non-immediate error (as in just trying to connect and hanging
there) but still not actually making a connection, haproxy will wait
up to timeout connect amount of time.   If after that much time, a
connection still isn't established, haproxy will immediately try to
connect again, rather than waiting 1s and then try connecting again to
the same backend server?


Not yet. Such enhancement has been recently suggested even with a patch, 
but wasn't implemented yet, as I would like to skip 1s turnaround only 
if there is a high chance to select a differet server. However, it is 
nearly on top of my short things TODO list.



This is what I see when I do the -sf option.

CUT


Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, check duration: 46ms. 1 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, check duration: 41ms. 0 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.

OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here, which is
not very unambiguous yet. There are three calls to set_server_check_status()
with NULL as an additional info:


$ egrep -R set_s.*HCHK_STATUS_SOCKERR.*NULL src
src/checks.c:   set_server_check_status(s,
HCHK_STATUS_SOCKERR, NULL);
src/checks.c:
set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);
src/checks.c:
set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);

Could you please try to change the second NULL to strerror(errno)?


I've made that patch.  I am using 1.4-dev5.tar.gz, but not the snaphot
from 1.4-ss-20100106.tar.gz.  With your three strerror(errno) patches
in, I am now seeing a bit more info in my /var/log/messages:

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, info: Resource temporarily unavailable, check
duration: 51ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, info: Resource temporarily unavailable, check
duration: 41ms. 0 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.


Like I thought - EAGAIN. It doesn't tell us too much. :(


I also sometimes get a slightly different error message on occasion:
Jan  6 12:21:49 lbtest1 haproxy[15709]: Server LDAPSFarm/LDAPS1 is
DOWN, reason: Socket error, info: Operation now in progress, check
duration: 277ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.


EINPROGRESS, the same. :(


I don't notice a pattern to which backend server health-check gets
either Operation now in progress or Resource temporarily unavailable
error.  It seems random.


For now, the only suggest I have for you is to try running haproxy under 
strace and check which syscalls fail shortly before a Socket error 
message is written. But I'm afaraid, we would end up needing to add 
something like: 
http://haproxy.1wt.eu/git?p=haproxy.git;a=commitdiff;h=6492db5453a3d398f096e9f7d6e84ea3984a1f04

in more places.


I noticed you are using addr to use localhost as the source address of
your healt-checks:


   server LDAP1 :389 check addr localhost port 9101 inter 5s
fastinter 1s downinter 5s fall 2 rise 2
   server LDAP2 :389 check addr localhost port 9102 inter 5s
fastinter 1s downinter 5s fall 2 rise 2

I think that could be the source of your problems.

I'll try to reproduce similar condition in my environment, but before I
could be able to do this - would you please try to drop addr localhost for
now and check if it makes any difference?


I need to do the check addr localhost port 9101 for example.  My
health-check scripts

Can HAProxy's Balancing Mechanism Be Called NAT?

2010-01-06 Thread Joe P.H. Chiang
Hi All
I was wondering if the HA Proxy's Balancing Mechanism be called a NAT
Mechanism, because it's masking the servers' IP addresses, and then route
the traffic to the location.

because i was just discussing with my colleague, and my argument is that:
it's only a proxy which in between two pc there are a surrogate pc to tell
where the traffic's destination is.

-- 
Thanks,
Joe