Re: [ANNOUNCE] haproxy 1.4-dev5 with keep-alive :-) memory comsuption
Definitely haproxy process, nothing else runs on there and the older version remains stable for days/weeks: F S UIDPID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 1 S nobody 15547 1 18 80 0 - 1026097 epoll_ 10:54 ? 00:54:30 /usr/sbin/haproxy14d5 -D -f /etc/haproxy/haproxyka.cfg -p /var/run/haproxy.pid -sf 15536 1 S nobody 20631 1 29 80 0 - 17843 epoll_ 13:48 ?00:33:37 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf 15547 On 1/5/10 11:10 PM, Willy Tarreau wrote: On Tue, Jan 05, 2010 at 11:00:30PM -0800, Hank A. Paulson wrote: Using git 034550b7420c24625a975f023797d30a14b80830 [BUG] stats: show UP/DOWN status also in tracking servers 6 hours ago... I am still seeing continuous memory consumption (about 1+ GB/hr) at 50-60 Mbps even after the number of connections has stablized: OK. Is this memory used by the haproxy process itself ? If so, could you please send me your exact configuration so that I may have a chance to spot something in the code related to what you use ? A memory leak is something very unlikely in haproxy, though it's not impossible. Everything works with pools which are released when the session closes. But maybe something in this area escaped from my radar (eg: header captures in keep-alive, etc...). 69 CLOSE_WAIT 9 CLOSING 4807 ESTABLISHED 35 FIN_WAIT1 4 FIN_WAIT2 255 LAST_ACK 10 LISTEN 3410 SYN_RECV This one is really impressive. 3410 SYN_RECV basically means you're under a SYN flood, or your network stack is not correctly tuned and you're slowing down your users a lot because they need to wait 3s before retransmitting. Regards, Willy Thanks, we pride ourselves on our huge SYN queue... :)
Re: [PATCH 4/5] [MEDIUM] default-server support
On 2010-01-06 00:44, Willy Tarreau wrote: Hi Krzysztof, Hi Willy, I've merged all of your patches. Thanks. However I have a minor concern about something in this one : -The currently supported settings are the following ones. +The currently supported settings are the following ones, the ones marked with +[D] are also upported for default-server. (...) -error-limit count +[D] error-limit count (...) -fall count +[D] fall count etc... While I understand the reason you have added this tag, it breaks the ability to search a keyword at the beginning of line using ^error-limit. Maybe we should put the tag at the end of the line (77th to 79th chars) or maybe we should simply suggest that everything is supported in defaults except those explicitly stated otherwise, and just add a one line comment for those keywords not supported in default-server ? How about adding Supported in default-server: Yes/No into each keyword? If you get a nice idea on this subject, feel free to send a patch for it. If you do so, please also add the following hunk that I wanted to fix but forgot at the last minute : diff --git a/doc/configuration.txt b/doc/configuration.txt index eaec73b..c6cfa98 100644 --- a/doc/configuration.txt +++ b/doc/configuration.txt @@ -724,7 +724,7 @@ capture response header - X X - clitimeout X X X - (deprecated) contimeout X - X X (deprecated) cookie X - X X -default-server X - X - +default-server X - X X default_backend - X X - description - X X X disabledX X X X Sure. I have no idea why I did it wrong. :| Best regards, Krzysztof Olędzki
[PATCH] [BUG] stats: cookie should be reported under backend not under proxy
From 7046fe6b5245deb06f760896fa7e10c2163eda60 Mon Sep 17 00:00:00 2001 From: Krzysztof Piotr Oledzki o...@ans.pl Date: Wed, 6 Jan 2010 15:03:18 +0100 Subject: [BUG] stats: cookie should be reported under backend not under proxy --- src/dumpstats.c | 29 - 1 files changed, 16 insertions(+), 13 deletions(-) diff --git a/src/dumpstats.c b/src/dumpstats.c index a9422e2..495e280 100644 --- a/src/dumpstats.c +++ b/src/dumpstats.c @@ -1239,17 +1239,6 @@ int stats_dump_proxy(struct session *s, struct proxy *px, struct uri_auth *uri) proxy_cap_str(px-cap), proxy_mode_str(px-mode), px-uuid); - /* cookie */ - if (px-cookie_name) { - struct chunk src; - - chunk_printf(msg, , cookie: '); - chunk_initlen(src, px-cookie_name, 0, strlen(px-cookie_name)); - chunk_htmlencode(msg, src); - - chunk_printf(msg, '); - } - chunk_printf(msg, \); } @@ -1897,9 +1886,23 @@ int stats_dump_proxy(struct session *s, struct proxy *px, struct uri_auth *uri) if (uri-flagsST_SHLGNDS) { /* balancing */ - -chunk_printf(msg, title=\balancing: %s\, +chunk_printf(msg, title=\balancing: %s, backend_lb_algo_str(px-lbprm.algo BE_LB_ALGO)); + + /* cookie */ + if (px-cookie_name) { + struct chunk src; + + chunk_printf(msg, , cookie: '); + + chunk_initlen(src, px-cookie_name, 0, strlen(px-cookie_name)); + chunk_htmlencode(msg, src); + + chunk_printf(msg, '); + } + + chunk_printf(msg, \); + } chunk_printf(msg, -- 1.6.4.2
[PATCH] [BUG] cfgparser/stats: fix error message
From 0d95cf9f607de59487c644af3077be1a84eb4b81 Mon Sep 17 00:00:00 2001 From: Krzysztof Piotr Oledzki o...@ans.pl Date: Wed, 6 Jan 2010 16:25:05 +0100 Subject: [BUG] cfgparser/stats: fix error message Fix the error message by unification and goto, previously we had two independent lists of supported keywords and were raporting 'stats' instead of a wrong keyword. Code: stats wrong-keyword stats Before: [ALERT] 005/163032 (27175) : parsing [haproxy.cfg:248] : unknown stats parameter 'stats' (expects 'hide-version', 'uri', 'realm', 'auth' or 'enable'). [ALERT] 005/163032 (27175) : parsing [haproxy.cfg:249] : 'stats' expects 'uri', 'realm', 'auth', 'scope' or 'enable', 'hide-version', 'show-node', 'show-desc', 'show-legends'. After: [ALERT] 005/162841 (22710) : parsing [haproxy.cfg:248]: unknown stats parameter 'wrong-keyword', expects 'uri', 'realm', 'auth', 'scope', 'enable', 'hide-version', 'show-node', 'show-desc' or 'show-legends'. [ALERT] 005/162841 (22710) : parsing [haproxy.cfg:249]: missing keyword in 'stats', expects 'uri', 'realm', 'auth', 'scope', 'enable', 'hide-version', 'show-node', 'show-desc' or 'show-legends'. --- src/cfgparse.c | 11 +-- 1 files changed, 5 insertions(+), 6 deletions(-) diff --git a/src/cfgparse.c b/src/cfgparse.c index c0d6dfe..f3cfc61 100644 --- a/src/cfgparse.c +++ b/src/cfgparse.c @@ -1978,10 +1978,8 @@ int cfg_parse_listen(const char *file, int linenum, char **args, int kwm) if (curproxy != defproxy curproxy-uri_auth == defproxy.uri_auth) curproxy-uri_auth = NULL; /* we must detach from the default config */ - if (*(args[1]) == 0) { - Alert(parsing [%s:%d] : '%s' expects 'uri', 'realm', 'auth', 'scope' or 'enable', 'hide-version', 'show-node', 'show-desc', 'show-legends'.\n, file, linenum, args[0]); - err_code |= ERR_ALERT | ERR_FATAL; - goto out; + if (!*args[1]) { + goto stats_error_parsing; } else if (!strcmp(args[1], uri)) { if (*(args[2]) == 0) { Alert(parsing [%s:%d] : 'uri' needs an URI prefix.\n, file, linenum); @@ -2110,8 +2108,9 @@ int cfg_parse_listen(const char *file, int linenum, char **args, int kwm) free(desc); } } else { - Alert(parsing [%s:%d] : unknown stats parameter '%s' (expects 'hide-version', 'uri', 'realm', 'auth' or 'enable').\n, - file, linenum, args[0]); +stats_error_parsing: + Alert(parsing [%s:%d]: %s '%s', expects 'uri', 'realm', 'auth', 'scope', 'enable', 'hide-version', 'show-node', 'show-desc' or 'show-legends'.\n, + file, linenum, *args[1]?unknown stats parameter:missing keyword in, args[*args[1]?1:0]); err_code |= ERR_ALERT | ERR_FATAL; goto out; } -- 1.6.4.2
Fwd: Per-Server arguments for httpchk health-checking
I sent this out yesterday (1/5/2010) but didn't see it come back to me on the list, nor do I see it in the mailing list archives at http://www.formilux.org/archives/haproxy/1001/date.html so I'm trying again. The other msg I sent yesterday (about Retries/Option Dispatch) did make it to the list (it's in the archive), although I didn't see that come to me in my email (sigh). I do occasionally get notices from the list manager about how an email to me bounced. If the list manager could send me a copy of what I bounced back, that'd be great, so I can track it down. Thank you, PH -- Forwarded message -- From: Paul Hirose paulhir...@gmail.com Date: Tue, Jan 5, 2010 at 10:42 AM Subject: Per-Server arguments for httpchk health-checking To: haproxy@formilux.org Is there a way to have haproxy send a per-server argument during a health-check? Right now, I do server A check addr localhost 9000 and server B check addr localhost 9001 and so on, and have xinetd monitor 9000/TCP and 9001/TCP. When haproxy connects to those, xinetd in turn runs a health-check script I wrote that actually does the checking. When done checking the script returns either HTTP200 or HTTP500 depending on if it's good or not. This means I start using one port per back-end server. If I have 10 server that haproxy spreads the load across, I have server ... 9002 and server ... 9003 and so on through 9009. I then have 10 different copies of my little script, each of which just connects to a different server (A, B, C...) and does the check. If haproxy could do something like server A check addr localhost 9000 argument A, I could have only one port watched by xinetd, and only one copy of the script, that would simply accept A as an argument of the server it should check. I use option httpchk with the above (all in the same listen group). But I wasn't sure if I could change option httpchk for every server. Could I do a: listen farm address:port balance roundrobin mode tcp option httpchk GET 1 server A 1.1.1.1:389 check addr localhost port 9000 inter 5s fastinter 1s downinter 120s option httpchk GET 2 server B 1.1.1.2:389 check addr localhost port 9000 inter 5s fastinter 1s downinter 120s Would that do a GET 1 when it tries a http health check for server A and do a GET 2 when doing a health check for server B? These are LDAP servers on the back-end, btw. I can't put the health-check script on the back-end server itself. And they wanted to make sure the health-check passed from the actual haproxy load-balancer talking to the back-end server, rather than the back-end server talking to itself. Thank you, PH
Backend servers flagged as DOWN a lot timeout check/connect
Busy little haproxy beaver today :) The docs under retries says if a connection attempt fails, it waits one second, and then tries again. I was wondering how (if at all) that works in conjunction with timeout connect, which is how long haproxy waits to try to connect to a backend server. Is the one second delay between retries *after* the timeout connect number of seconds (after all, until timeout connect number of seconds has passed, the connection attempt hasn't failed)? I stumbled across timeout check today. I've noticed my backend servers tend to get flagged as DOWN a lot, especially when I first start or reload haproxy. Then usually, a few inter (or downinter) seconds later, it gets flagged as up. The backend server is definitely not down during that time. I suppose it's really not haproxy itself, but either my own health-check script and/or xinetd (which launches my health-check script) that might be causing a problem. i don't know why it's doing this. I do notice that whenever I do have a backend server flagged as down, and I do a ps to look around, there are a few instances of my health-check script running (or stalled or whatever.) After haproxy connects, it waits timeout check or inter time for a response before giving up and calling that a failure? But since it's launched from xinetd, even though haproxy might close the connection after timeout check (or inter) amount of time, I think the health check script process continues to stick around until it's done. I was thinking I might try setting fastinter 1s and timeout check 900 (milliseconds, I think, by default), and fall 4. So if, for some reason, a check fails (my script, xinetd, backend server, etc stalls), then it'll only wait 900ms. Then it'll try again 1s later. I figure w/in (900ms + 1s) later, it might be ok and respond back properly (ignoring why it may have failed the first time.) Not the cleanest way, but if anyone has suggestions, I'd welcome them. I tried using 1.4dev5 rather than the stable 1.3.22. I noticed 1.4d5 shows more diagnostics. in my /var/log/messages. This is what I see when I do the -sf option.I also noticed it jumps a PID. 15286 is the old process. I run haproxy with -sf and it starts a new process 21905. The old one pauses the proxy, the new one starts the proxy, and then the old one finally stops. The new one, I guess, tries to bring up or checks the status of the backend servers of one farm, and thinks they're all down because of socket error. But then it changes PID to 21906, and starts checking the backend servers of another farm. From there, it stays running as this new PID. Jan 6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPFarm. Jan 6 09:37:41 lbtest1 haproxy[15286]: Pausing proxy LDAPSFarm. Jan 6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPFarm started. Jan 6 09:37:41 lbtest1 haproxy[21905]: Proxy LDAPSFarm started. Jan 6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPFarm in 0 ms. Jan 6 09:37:41 lbtest1 haproxy[15286]: Stopping proxy LDAPSFarm in 0 ms. Jan 6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPFarm stopped. Jan 6 09:37:41 lbtest1 haproxy[15286]: Proxy LDAPSFarm stopped. Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN, reason: Socket error, check duration: 46ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN, reason: Socket error, check duration: 41ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:41 lbtest1 haproxy[21905]: proxy LDAPFarm has no server available! Jan 6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is DOWN, reason: Socket error, check duration: 277ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:42 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is DOWN, reason: Socket error, check duration: 407ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:42 lbtest1 haproxy[21906]: proxy LDAPSFarm has no server available! Jan 6 09:37:47 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP2 is UP, reason: Layer7 check passed, code: 200, info: OK, check duration: 354ms. 1 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. Jan 6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS1 is UP, reason: Layer7 check passed, code: 200, info: OK, check duration: 595ms. 1 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. Jan 6 09:37:49 lbtest1 haproxy[21906]: Server LDAPFarm/LDAP1 is UP, reason: Layer7 check passed, code: 200, info: OK, check duration: 572ms. 2 active and 0 backup servers left. 0 sessions active, -1 requeued, 0 remaining in queue. Jan 6 09:37:49 lbtest1 haproxy[21906]: Server LDAPSFarm/LDAPS2 is UP, reason: Layer7 check passed, code: 200, info: OK, check duration: 595ms. 2 active and 0 backup servers left. 0
Re: [ANNOUNCE] haproxy 1.4-dev5 with keep-alive :-)
Le Mardi 5 Janvier 2010 23:42:46, Willy Tarreau a écrit : On Tue, Jan 05, 2010 at 11:14:32PM +0100, Cyril Bonté wrote: Well, eventually after several different tests, that's OK for me. A short http-request timeout (some seconds max) will prevent the accumulation of connections ESTABLISHED in the haproxy-client side (which use sessions in haproxy that will never read anything) but inexistant in the client-haproxy side. indeed, and against this you can also use option abortonclose which will simply abort the requests if their input channel is empty before a connection is established. Sadly not, for this specific case. After 20 hours (no timeout were set), the connection remained established. -- Cyril Bonté
Re: [ANNOUNCE] haproxy 1.4-dev5 with keep-alive :-)
Hi Cyril, On Wed, Jan 06, 2010 at 08:58:17PM +0100, Cyril Bonté wrote: Le Mardi 5 Janvier 2010 23:42:46, Willy Tarreau a écrit : On Tue, Jan 05, 2010 at 11:14:32PM +0100, Cyril Bonté wrote: Well, eventually after several different tests, that's OK for me. A short http-request timeout (some seconds max) will prevent the accumulation of connections ESTABLISHED in the haproxy-client side (which use sessions in haproxy that will never read anything) but inexistant in the client-haproxy side. indeed, and against this you can also use option abortonclose which will simply abort the requests if their input channel is empty before a connection is established. Sadly not, for this specific case. After 20 hours (no timeout were set), the connection remained established. This morning I found one good reason for those issues. I cannot reproduce them on the lab but they slowly accumulate on one of the two prod servers (about 10-20 per day). The cause lies in the way the analysers are re-enabled to parse a second request. My assumption was that once enabling one analyser from another one, it would automatically be called, but that's not the case if the target one was already called in the same round. And unfortunately, by the time it is enabled, there is no more I/O on the socket so it never gets woken up again. So I have started to see how to correctly call an analyser once, then only if it is re-enabled by another analyser. Now I think I have found the right logics for this, I just need to run it by hand first to ensure it's ok, then implement. Additionally, I have noticed some dangerous changes on the BF_DONT_READ flag under some circumstances, which sometimes disable any further reading on a socket. That combined with the issue above can theorically definitely freeze a socket. I have looked at the sessions status by connecting to the stats socket, here's what I got : show sess 0x816f068: proto=tcpv4 src=XX.XX.XXX.XXX:44578 fe=public be=public srv=none ts=08 age=11h21m calls=12 rq[f=1501000h,l=783,an=0eh,rx=,wx=,ax=] rp[f=2001000h,l=0,an=00h,rx=,wx=,ax=] s0=[7,18h,fd=2,ex=] s1=[0,0h,fd=-1,ex=] exp= rq.f = 1501000h = 1=BF_DONT_READ rq.an = 0eh = 3 analysers including http_wait_request(), which would have cleared BF_DONT_READ if it were called again. So I have also made the rules to use BF_DONT_READ stricter so that we can't leave it set when leaving an analyser. I already have the patch for that, but without the former fix it will not bring anything, so I want to fix the other one first, and will keep you updated. I will not release -dev6 until it can run for one day on both prod servers without leaving *any* stuck session, otherwise that's plain unacceptable. And I want to fix the issues at their root, not just the symptoms. Thanks for your tests and feedback, Willy
Re: [PATCH 2/5] [MINOR] stats: add a link a href for sockets
On 2010-01-06 20:18, Cyril Bonté wrote: Hi Krzysztof and Willy, Le Mardi 5 Janvier 2010 17:08:23, Krzysztof Piotr Oledzki a écrit : This patch adds add a link a href html tags for sockets. As sockets may have the same name like servers, I decided to add + char (forbidden in names assigned to servers), as a prefix. After reading this commit, I've tested the socket-stats option introduced with that one : http://haproxy.1wt.eu/git?p=haproxy.git;a=commit;h=aeebf9ba6574ca5b8c352685546c0799ecd5e259 I think it can be very helpful for diagnostics if the auto-calcultated listener name displays the listener address and port instead of its id. See the patch in attachment if it's OK for you (done with the 20100106 snapshot) ;) Good Idea! However, the same functionality is provided by very recently introduced stats show-legends option, but you need to enable it explicitly. I didn't make it enabled by default as I don't want to provide such information to everyone, who is just able to access stats. If you enable it, all you need to do is to move your mouse cursor over the listener's td. Does this satisfy your needs? Best regards, Krzysztof Olędzki
Re: Backend servers flagged as DOWN a lot timeout check/connect
Thank you for your help.:) 2010/1/6 Krzysztof Olędzki o...@ans.pl: On 2010-01-06 18:45, Paul Hirose wrote: The docs under retries says if a connection attempt fails, it waits one second, and then tries again. This 1s timeout is only used in case of immediately error (like TCP RST), not in case of timeouts. I was wondering how (if at all) that works in conjunction with timeout connect, which is how long haproxy waits to try to connect to a backend server. Is the one second delay between retries *after* the timeout connect number of seconds (after all, until timeout connect number of seconds has passed, the connection attempt hasn't failed)? - above two timeouts are independent, - there is no 1s turnaround after a timeout. So to summarize the timeout issues connecting to the backend server, a client request comes into haproxy, which is then sent to one of the backend servers. If the connection fails immediately, then haproxy waits 1s and then tries again to the same backend server. It repeats this up to retries number of times, or up to retries - 1 amount of times in option redispatch is set (the last retry being sent to some other backend server.) For a non-immediate error (as in just trying to connect and hanging there) but still not actually making a connection, haproxy will wait up to timeout connect amount of time. If after that much time, a connection still isn't established, haproxy will immediately try to connect again, rather than waiting 1s and then try connecting again to the same backend server? This is what I see when I do the -sf option. CUT Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN, reason: Socket error, check duration: 46ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN, reason: Socket error, check duration: 41ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here, which is not very unambiguous yet. There are three calls to set_server_check_status() with NULL as an additional info: $ egrep -R set_s.*HCHK_STATUS_SOCKERR.*NULL src src/checks.c: set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL); src/checks.c: set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL); src/checks.c: set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL); Could you please try to change the second NULL to strerror(errno)? I've made that patch. I am using 1.4-dev5.tar.gz, but not the snaphot from 1.4-ss-20100106.tar.gz. With your three strerror(errno) patches in, I am now seeing a bit more info in my /var/log/messages: Jan 6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP1 is DOWN, reason: Socket error, info: Resource temporarily unavailable, check duration: 51ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP2 is DOWN, reason: Socket error, info: Resource temporarily unavailable, check duration: 41ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. I also sometimes get a slightly different error message on occasion: Jan 6 12:21:49 lbtest1 haproxy[15709]: Server LDAPSFarm/LDAPS1 is DOWN, reason: Socket error, info: Operation now in progress, check duration: 277ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. I don't notice a pattern to which backend server health-check gets either Operation now in progress or Resource temporarily unavailable error. It seems random. I noticed you are using addr to use localhost as the source address of your healt-checks: server LDAP1 :389 check addr localhost port 9101 inter 5s fastinter 1s downinter 5s fall 2 rise 2 server LDAP2 :389 check addr localhost port 9102 inter 5s fastinter 1s downinter 5s fall 2 rise 2 I think that could be the source of your problems. I'll try to reproduce similar condition in my environment, but before I could be able to do this - would you please try to drop addr localhost for now and check if it makes any difference? I need to do the check addr localhost port 9101 for example. My health-check scripts actually run on the same computer as haproxy runs (and not on the backend server.) I don't have access to the actual backend server(s) and thus cannot put a health-check script on them. I changed localhost to 127.0.0.1 just on the offchance there might be something there. My xinetd.conf has instances 50, per_source 10, so I figure xinetd should be able to run multiple copies of my health-check scripts at one time, if it came to that. I do have spread-checks 20 in my haproxy.cfg file, just to try and spread it around. But I figure a reload/start of haproxy won't spread the checks around. Thank you, PH
# of haproxy processes after a -sf
haproxy process running, processing one very long request. I run another haproxy -sf, which does the whole reload, etc stuff. The long request (ldap database query) continues on its way. :) So that's good :) But I thought the old process would stick around until that entire request was done, even though the new haproxy process is now running and processing all new incoming requests and doing the balancing, etc. But a quick ps -auxw | grep haproxy shows only one process. I just expected two. I guess it's ok, since the big long request continues (doesn't get aborted), and the new config file parameters are working fine, and new incoming connections are being processed :) Anyway, just thought I'd ask on that :) Thank you, PH
Re: [PATCH 2/5] [MINOR] stats: add a link a href for sockets
Le Mercredi 6 Janvier 2010 21:28:35, Krzysztof Olędzki a écrit : On 2010-01-06 20:18, Cyril Bonté wrote: Hi Krzysztof and Willy, Le Mardi 5 Janvier 2010 17:08:23, Krzysztof Piotr Oledzki a écrit : This patch adds add a link a href html tags for sockets. As sockets may have the same name like servers, I decided to add + char (forbidden in names assigned to servers), as a prefix. After reading this commit, I've tested the socket-stats option introduced with that one : http://haproxy.1wt.eu/git?p=haproxy.git;a=commit;h=aeebf9ba6574ca5b8c352685546c0799ecd5e259 I think it can be very helpful for diagnostics if the auto-calcultated listener name displays the listener address and port instead of its id. See the patch in attachment if it's OK for you (done with the 20100106 snapshot) ;) Good Idea! However, the same functionality is provided by very recently introduced stats show-legends option, but you need to enable it explicitly. I didn't make it enabled by default as I don't want to provide such information to everyone, who is just able to access stats. Ah yes, I missed it. If you enable it, all you need to do is to move your mouse cursor over the listener's td. Does this satisfy your needs? Yes, but not completely as the information can't be found in the CSV export, which could be useful for monitoring tools. That said, if I need it, I can manually set the name on the bind lines so it's maybe OK. Thanks. -- Cyril Bonté
Re: Backend servers flagged as DOWN a lot timeout check/connect
On 2010-01-06 21:31, Paul Hirose wrote: 2010/1/6 Krzysztof Olędzki o...@ans.pl: On 2010-01-06 18:45, Paul Hirose wrote: The docs under retries says if a connection attempt fails, it waits one second, and then tries again. This 1s timeout is only used in case of immediately error (like TCP RST), not in case of timeouts. I was wondering how (if at all) that works in conjunction with timeout connect, which is how long haproxy waits to try to connect to a backend server. Is the one second delay between retries *after* the timeout connect number of seconds (after all, until timeout connect number of seconds has passed, the connection attempt hasn't failed)? - above two timeouts are independent, - there is no 1s turnaround after a timeout. So to summarize the timeout issues connecting to the backend server, a client request comes into haproxy, which is then sent to one of the backend servers. If the connection fails immediately, then haproxy waits 1s and then tries again to the same backend server. It repeats this up to retries number of times, or up to retries - 1 amount of times in option redispatch is set (the last retry being sent to some other backend server.) For a non-immediate error (as in just trying to connect and hanging there) but still not actually making a connection, haproxy will wait up to timeout connect amount of time. If after that much time, a connection still isn't established, haproxy will immediately try to connect again, rather than waiting 1s and then try connecting again to the same backend server? Not yet. Such enhancement has been recently suggested even with a patch, but wasn't implemented yet, as I would like to skip 1s turnaround only if there is a high chance to select a differet server. However, it is nearly on top of my short things TODO list. This is what I see when I do the -sf option. CUT Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN, reason: Socket error, check duration: 46ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN, reason: Socket error, check duration: 41ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here, which is not very unambiguous yet. There are three calls to set_server_check_status() with NULL as an additional info: $ egrep -R set_s.*HCHK_STATUS_SOCKERR.*NULL src src/checks.c: set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL); src/checks.c: set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL); src/checks.c: set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL); Could you please try to change the second NULL to strerror(errno)? I've made that patch. I am using 1.4-dev5.tar.gz, but not the snaphot from 1.4-ss-20100106.tar.gz. With your three strerror(errno) patches in, I am now seeing a bit more info in my /var/log/messages: Jan 6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP1 is DOWN, reason: Socket error, info: Resource temporarily unavailable, check duration: 51ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Jan 6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP2 is DOWN, reason: Socket error, info: Resource temporarily unavailable, check duration: 41ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Like I thought - EAGAIN. It doesn't tell us too much. :( I also sometimes get a slightly different error message on occasion: Jan 6 12:21:49 lbtest1 haproxy[15709]: Server LDAPSFarm/LDAPS1 is DOWN, reason: Socket error, info: Operation now in progress, check duration: 277ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. EINPROGRESS, the same. :( I don't notice a pattern to which backend server health-check gets either Operation now in progress or Resource temporarily unavailable error. It seems random. For now, the only suggest I have for you is to try running haproxy under strace and check which syscalls fail shortly before a Socket error message is written. But I'm afaraid, we would end up needing to add something like: http://haproxy.1wt.eu/git?p=haproxy.git;a=commitdiff;h=6492db5453a3d398f096e9f7d6e84ea3984a1f04 in more places. I noticed you are using addr to use localhost as the source address of your healt-checks: server LDAP1 :389 check addr localhost port 9101 inter 5s fastinter 1s downinter 5s fall 2 rise 2 server LDAP2 :389 check addr localhost port 9102 inter 5s fastinter 1s downinter 5s fall 2 rise 2 I think that could be the source of your problems. I'll try to reproduce similar condition in my environment, but before I could be able to do this - would you please try to drop addr localhost for now and check if it makes any difference? I need to do the check addr localhost port 9101 for example. My health-check scripts
Can HAProxy's Balancing Mechanism Be Called NAT?
Hi All I was wondering if the HA Proxy's Balancing Mechanism be called a NAT Mechanism, because it's masking the servers' IP addresses, and then route the traffic to the location. because i was just discussing with my colleague, and my argument is that: it's only a proxy which in between two pc there are a surrogate pc to tell where the traffic's destination is. -- Thanks, Joe