Re: Truncated health check response from real servers
Hi Willy, On Saturday 10 July 2010, Willy Tarreau wrote: Actually, I am not that much interested in the patch per-se, I would just like to check the body of the check-response and compare it to a string for more reliable checks. It's what the patch does. Hmmm, I thought it was bigger... I have included my original mail (and reasoning) on the subject below. It's not on top of my priority list right now either, still... sooner or later we might get bitten by it. I'd love to help out coding, but sockets and C++ are not my strong points (though I am comfortable in C). So you're welcome, haproxy is plain C, and I don't understand C++ either ;-) How did I miss that? :-D Ok, I'll take a look at it then. Thanks for updating the patch and for pointers on where to look for problematic sections. I can't promise it will be soon though, we have a lot of work right now... but I'll do my best. Thanks, Anze
Re: Truncated health check response from real servers
Hi Anze, On Mon, Jul 12, 2010 at 09:28:09AM +0200, Anze wrote: Hi Willy, On Saturday 10 July 2010, Willy Tarreau wrote: Actually, I am not that much interested in the patch per-se, I would just like to check the body of the check-response and compare it to a string for more reliable checks. It's what the patch does. Hmmm, I thought it was bigger... It has been quite simplified with Nick's work on buffering server checks. I'd say that I too recalled a bigger one ;-) Ok, I'll take a look at it then. Thanks for updating the patch and for pointers on where to look for problematic sections. I can't promise it will be soon though, we have a lot of work right now... but I'll do my best. You're welcome. Don't hurry anyway, this patch has been lying around for more than 1 year now, there's no emergency anymore ! Cheers, Willy
Re: Truncated health check response from real servers
Hi Anze, On Fri, Jul 09, 2010 at 01:48:12AM +0200, Anze wrote: Indeed it never made it in. Each time I look at it, I realize that cleaning it up and fixing it will take more time than what I have available to work on it :-/ I can understand that completely. :) Actually, I am not that much interested in the patch per-se, I would just like to check the body of the check-response and compare it to a string for more reliable checks. It's what the patch does. I have included my original mail (and reasoning) on the subject below. It's not on top of my priority list right now either, still... sooner or later we might get bitten by it. I'd love to help out coding, but sockets and C++ are not my strong points (though I am comfortable in C). So you're welcome, haproxy is plain C, and I don't understand C++ either ;-) I can help with testing though if needed. Let me know if it helps and if/when you find time for this. :) Or, if you wish, I can take Nick's patch and extract just the parts with checking the response body? Can you point out some of the errors so I can avoid them? I've rediffed the patch against 1.4.8, you can find it here : http://haproxy.1wt.eu/download/patches/http-ecv-checks-1.4.8.diff Most of the things I'm particularly concerned are in httpchk_expect() : - multiple calls to strstr(), strncmp() or regexec() on a non-null terminated string. It only works if the buffer contained a null from a previous use, otherwise it will happily overflow the buffer ! Either we find how to properly bound the functions, though I'm not sure we can for regexec(), or we force a '\0' at the end of the data received from the server, possibly truncating the response by one byte if the buffer is full. - a calloc() call inside the check code ! We don't want to perform dynamic memory allocation outside of the pools once the code is running. Also, the return is not checked. The proper way to replace this is to use trash[] as a temporary variable, it's designed for this. - working mode checked with strcmp() ! This is particularly slow and complicated, I'd rather have enums that are correctly set when the config is parsed, and make use of them normally with a swich/case of even an if statement. - one or a few inline variable declarations which will break build on older compilers still in use in some embedded systems (easily fixed though). As you see, there's not even any socket work on this, so if you feel comfortable with this, feel free to contribute, I think this is the patch that received the most different contributors' work. Once it's OK, I'm willing to merge it into next 1.4 stable because I know that several people would like to use it, and it doesn't affect other parts of code, so there's no risk of regression if it breaks. Thanks, Willy
Re: Truncated health check response from real servers
Indeed it never made it in. Each time I look at it, I realize that cleaning it up and fixing it will take more time than what I have available to work on it :-/ I can understand that completely. :) Actually, I am not that much interested in the patch per-se, I would just like to check the body of the check-response and compare it to a string for more reliable checks. I have included my original mail (and reasoning) on the subject below. It's not on top of my priority list right now either, still... sooner or later we might get bitten by it. I'd love to help out coding, but sockets and C++ are not my strong points (though I am comfortable in C). I can help with testing though if needed. Let me know if it helps and if/when you find time for this. :) Or, if you wish, I can take Nick's patch and extract just the parts with checking the response body? Can you point out some of the errors so I can avoid them? Thanks, Anze On Friday 12 March 2010, Anze wrote: Hi! First of all: Willy Tarreau others, thank you for a great piece of software! It just works. :) PHP has a bug when errors in PHP code do not always issue HTTP error status code to be sent, but instead return status 200 (http://bugs.php.net/bug.php?id=50921) The problem is that when haproxy calls check.php, the scrpt itself is not run if there is an error in it (or in the includes it needs for the checks). So this check is not really reliable. I know this is not really haproxy's problem (the bug is on the PHP side), but reliability would be much higher if haproxy could compare the response body to some string (for instance, server must return really ok in request body). That way the bug in interpreter could never cause a false positive. Example PHP script: ? // this fails: call_some_undefined_function(); // this is not executed: echo really ok ? This fails silently and return an empty document (or document with error description, depending on PHP config) with status 200. As a side note, that would also allow us to catch PHP configuration errors, such as short_tags set to off - in which case the previous code would be just listed (with status 200!), not executed. When compared to really ok, it would fail. Any chance this can be done with haproxy? Thanks, Anze On Thursday 08 July 2010, Willy Tarreau wrote: On Wed, Jul 07, 2010 at 11:37:48AM +0200, Anze wrote: Hi all, We have tried upgrading our haproxy version to 1.4.8 and found than this configuration setting fails: http-check expect string check_is_ok with message: 'http-check' only supports 'disable-on-404' I guess the patch never made it to stable (yet?). Are there any plans to add it to the trunk? Indeed it never made it in. Each time I look at it, I realize that cleaning it up and fixing it will take more time than what I have available to work on it :-/ It's still available online though, but I would really not run it if I were you, as it still lacks some memory bounds checkings. Regards, Willy
Re: Truncated health check response from real servers
Hi all, We have tried upgrading our haproxy version to 1.4.8 and found than this configuration setting fails: http-check expect string check_is_ok with message: 'http-check' only supports 'disable-on-404' I guess the patch never made it to stable (yet?). Are there any plans to add it to the trunk? Thanks, Anze On Wednesday 17 March 2010, Cyril Bonté wrote: Hi all, Le Mardi 16 Mars 2010 21:35:10, Willy Tarreau a écrit : I'm now gathering my changes and committing your patch with the small fixes above. That way we can concentrate on ECV. I've played a little with this commits. I added some traces to see what happens when a timeout occurs and noticed that the response is concatenated after the response of the previous timed out check, until the buffer is full. This can occur when the server started to send a response but was too long to finish or if a check uses HTTP/1.1 on a server with a HTTP Keep-Alive longer than the check timeout (adding Connection: close could help prevent this, maybe it should be added in the option httpchk documentation). The expect option from the ECV patch can be affected by this. The patch provided in attachment resets the buffer after a timeout bur I wonder if it can't be done in set_server_check_status(). Also, to save a little memory, the check_data buffer is only allocated for the servers that are checked.
Re: Truncated health check response from real servers
Hi all, Le Mardi 16 Mars 2010 21:35:10, Willy Tarreau a écrit : I'm now gathering my changes and committing your patch with the small fixes above. That way we can concentrate on ECV. I've played a little with this commits. I added some traces to see what happens when a timeout occurs and noticed that the response is concatenated after the response of the previous timed out check, until the buffer is full. This can occur when the server started to send a response but was too long to finish or if a check uses HTTP/1.1 on a server with a HTTP Keep-Alive longer than the check timeout (adding Connection: close could help prevent this, maybe it should be added in the option httpchk documentation). The expect option from the ECV patch can be affected by this. The patch provided in attachment resets the buffer after a timeout bur I wonder if it can't be done in set_server_check_status(). Also, to save a little memory, the check_data buffer is only allocated for the servers that are checked. -- Cyril Bonté diff -Naur haproxy-ss-20100317/src/cfgparse.c haproxy-ss-20100317-checks/src/cfgparse.c --- haproxy-ss-20100317/src/cfgparse.c 2010-03-16 22:57:27.0 +0100 +++ haproxy-ss-20100317-checks/src/cfgparse.c 2010-03-17 18:37:21.0 +0100 @@ -3116,12 +3116,6 @@ newsrv-curfd = -1; /* no health-check in progress */ newsrv-health = newsrv-rise; /* up, but will fall down at first failure */ - /* Allocate buffer for partial check results... */ - if ((newsrv-check_data = calloc(BUFSIZE, sizeof(char))) == NULL) { -Alert(parsing [%s:%d] : out of memory while allocating check buffer.\n, file, linenum); -err_code |= ERR_ALERT | ERR_ABORT; -goto out; - } cur_arg = 3; } else { newsrv = curproxy-defsrv; @@ -3563,6 +3557,13 @@ goto out; } + /* Allocate buffer for partial check results... */ + if ((newsrv-check_data = calloc(BUFSIZE, sizeof(char))) == NULL) { +Alert(parsing [%s:%d] : out of memory while allocating check buffer.\n, file, linenum); +err_code |= ERR_ALERT | ERR_ABORT; +goto out; + } + newsrv-check_status = HCHK_STATUS_INI; newsrv-state |= SRV_CHECKED; } diff -Naur haproxy-ss-20100317/src/checks.c haproxy-ss-20100317-checks/src/checks.c --- haproxy-ss-20100317/src/checks.c 2010-03-16 22:57:27.0 +0100 +++ haproxy-ss-20100317-checks/src/checks.c 2010-03-17 18:43:12.0 +0100 @@ -1345,6 +1345,9 @@ else /* HTTP, SMTP */ set_server_check_status(s, HCHK_STATUS_L7TOUT, NULL); } +/* Reset the check buffer... */ +*s-check_data = '\0'; +s-check_data_len = 0; } //fprintf(stderr, process_chk: 10\n); diff -Naur haproxy-ss-20100317/src/haproxy.c haproxy-ss-20100317-checks/src/haproxy.c --- haproxy-ss-20100317/src/haproxy.c 2010-03-16 22:57:27.0 +0100 +++ haproxy-ss-20100317-checks/src/haproxy.c 2010-03-17 18:38:08.0 +0100 @@ -859,7 +859,9 @@ free(s-id); free(s-cookie); - free(s-check_data); + if (s-check_data) { +free(s-check_data); + } free(s); s = s_next; }/* end while(s) */
Re: Truncated health check response from real servers
Le Mercredi 17 Mars 2010 21:12:45, Willy Tarreau a écrit : (...) I don't agree with restting the buffer or even considering we have an error when a session does not close, because it is a regression. For instance, all people using HTTP/1.1 checks will see a problem here. hey right, I suddenly realize I forgot to recheck that ! Rather, I think we should start the parsing as soon as the response is large enough to be parsed, and only declare a parse error if we reach the end of response on a closed response. That way we can still parse status codes as soon as we get the first packet for most cases, but we can also wait for more data. It means that for some rare cases, we can parse the response multiple times waiting for it to be complete, but that's rather rare and right now those cases don't work at all. I'll see what can be done. That looks much better, no need to wait for a complete response if only the status is checked :) And the resource will be released faster in most of the cases. -- Cyril Bonté
Re: Truncated health check response from real servers
diff -ur haproxy-1.4.1/include/types/proxy.h haproxy-1.4.1-ecv-test/include/types/proxy.h --- haproxy-1.4.1/include/types/proxy.h 2010-03-04 22:39:19.0 + +++ haproxy-1.4.1-ecv-test/include/types/proxy.h 2010-03-15 10:15:40.0 + @@ -137,6 +137,8 @@ #define PR_O2_MYSQL_CHK 0x0002 /* use MYSQL check for server health */ #define PR_O2_USE_PXHDR 0x0004 /* use Proxy-Connection for proxy requests */ #define PR_O2_CHK_SNDST 0x0008 /* send the state of each server along with HTTP health checks */ +#define PR_O2_EXPECT 0x0010 /* http-check expect sth */ +#define PR_O2_NOEXPECT 0x0020 /* http-check expect ! sth */ /* end of proxy-options2 */ /* bits for sticking rules */ @@ -274,6 +276,9 @@ int grace;/* grace time after stop request */ char *check_req; /* HTTP or SSL request to use for PR_O_HTTP_CHK|PR_O_SSL3_CHK */ int check_len;/* Length of the HTTP or SSL3 request */ + char *expect_str; /* http-check expected content */ + regex_t *expect_regex; /* http-check expected content */ + char *expect_type; /* type of http-check, such as status, string */ struct chunk errmsg[HTTP_ERR_SIZE]; /* default or customized error messages for known errors */ int uuid;/* universally unique proxy ID, used for SNMP */ unsigned int backlog; /* force the frontend's listen backlog */ diff -ur haproxy-1.4.1/include/types/server.h haproxy-1.4.1-ecv-test/include/types/server.h --- haproxy-1.4.1/include/types/server.h 2010-03-04 22:39:19.0 + +++ haproxy-1.4.1-ecv-test/include/types/server.h 2010-03-15 10:15:40.0 + @@ -147,6 +147,9 @@ struct freq_ctr sess_per_sec; /* sessions per second on this server */ int puid;/* proxy-unique server ID, used for SNMP */ + char *check_data; /* storage of partial check results */ + int check_data_len; /* length of partial check results stored in check_data */ + struct { const char *file; /* file where the section appears */ int line; /* line where the section appears */ diff -ur haproxy-1.4.1/src/cfgparse.c haproxy-1.4.1-ecv-test/src/cfgparse.c --- haproxy-1.4.1/src/cfgparse.c 2010-03-04 22:39:19.0 + +++ haproxy-1.4.1-ecv-test/src/cfgparse.c 2010-03-15 10:15:40.0 + @@ -2874,8 +2874,65 @@ /* enable emission of the apparent state of a server in HTTP checks */ curproxy-options2 |= PR_O2_CHK_SNDST; } + else if (strcmp(args[1], expect) == 0) { + if (strcmp(args[2], status) == 0 || strcmp(args[2], string) == 0) { +curproxy-options2 |= PR_O2_EXPECT; +if (*(args[3]) == 0) { + Alert(parsing [%s:%d] : '%s %s %s' expects regex as an argument.\n, + file, linenum, args[0], args[1], args[2]); + return -1; +} +curproxy-expect_type = strdup(args[2]); +curproxy-expect_str = strdup(args[3]); + } +else if (strcmp(args[2], rstatus) == 0 || strcmp(args[2], rstring) == 0) { +curproxy-options2 |= PR_O2_EXPECT; +if (*(args[3]) == 0) { +Alert(parsing [%s:%d] : '%s %s %s' expects regex as an argument.\n, +file, linenum, args[0], args[1], args[2]); +return -1; +} +curproxy-expect_regex = calloc(1, sizeof(regex_t)); +if (regcomp(curproxy-expect_regex, args[3], REG_EXTENDED) != 0) { +Alert(parsing [%s:%d] : bad regular expression '%s'.\n, file, linenum, args[0]); +return -1; +} +curproxy-expect_type = strdup(args[2]); +} +else if (strcmp(args[2], !) == 0 ) { +curproxy-options2 |= PR_O2_NOEXPECT; +if (strcmp(args[3], status) == 0 || strcmp(args[3], string) == 0) { + if (*(args[4]) == 0) { + Alert(parsing [%s:%d] : '%s %s %s %s' expects regex as an argument.\n, + file, linenum, args[0], args[1], args[2], args[3]); + return -1; + } + curproxy-expect_type = strdup(args[3]); + curproxy-expect_str = strdup(args[4]); +} +else if (strcmp(args[3], rstatus) == 0 || strcmp(args[3], rstring) == 0) { + if (*(args[4]) == 0) { + Alert(parsing [%s:%d] : '%s %s %s %s' expects regex as an argument.\n, + file, linenum, args[0], args[1], args[2], args[3]); + return -1; + } + +
Re: Truncated health check response from real servers
Hi Nick, Thanks for the update. I've quickly reviewed it and noticed some of the issues of the initial ECV patch (though I don't remember them all, I'll have to dig into my mailbox). I'm putting a few examples below. What I can propose you is to proceed in 3 phases : - I will try to extract the two features from your patch (response reassembly and ECV), and apply the first one to next 1.4. - I'll try to fix the remaining issues of the ECV code and post it for review. - then once everyone is OK and we get satisfying results, we merge it into another 1.4, so that we'll finally get it into 1.4 stable. There's also something we lose with the ECV patch : if ECV is in use, then we can't enable the disable-on-404 feature anymore. I think we could still combine them both by checking for the 404 when the response does not match the criterion. I'm not exactly sure how but we'll see that later. Some issues below : On Tue, Mar 16, 2010 at 03:50:46PM +, n...@loadbalancer.org wrote: + else if (strcmp(args[1], expect) == 0) { + if (strcmp(args[2], status) == 0 || strcmp(args[2], string) == 0) { + curproxy-options2 |= PR_O2_EXPECT; + if (*(args[3]) == 0) { + Alert(parsing [%s:%d] : '%s %s %s' expects regex as an argument.\n, warning, it's not a regex but a string in the error message. + char *contentptr = strstr(s-check_data, \r\n\r\n); here we can go past the end of the buffer. = segfault if pattern missing. + /* Check the response content against the supplied string + * or regex... */ + if (strcmp(s-proxy-expect_type, string) == 0) it's a waste of CPU cycles to compare strings for just a type which fits in an enum. Regards, Willy
Re: Truncated health check response from real servers
On Tue, Mar 16, 2010 at 06:22:09PM +0100, Willy Tarreau wrote: What I can propose you is to proceed in 3 phases : - I will try to extract the two features from your patch (response reassembly and ECV), and apply the first one to next 1.4. OK, your code was clean and the two parts were really distinct. It was a child's game to split them. Doing so, I discovered one minor and one medium issues which I fixed. The minor one is that detecting the end of a response now always requires another poll() call, which can be expensive on LBs with hundreds of servers, especially since in HTTP the close is almost always there pending after the response : 20:20:03.958207 recv(7, HTTP/1.1 200\r\nConnection: close\r\n..., 8030, 0) = 145 20:20:03.958365 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 20:20:03.958543 gettimeofday({1268767203, 958626}, NULL) = 0 20:20:03.958694 recv(7, ..., 7885, 0) = 0 20:20:03.958833 shutdown(7, 2 /* send and receive */) = 0 I've arranged the recv() call to be able to read up to EAGAIN or error or end, and now we get both the response and the close in the same call : 20:29:58.797019 recv(7, HTTP/1.1 200\r\nConnection: close\r\n..., 8030, 0) = 145 20:29:58.797182 recv(7, ..., 7885, 0) = 0 20:29:58.797356 shutdown(7, 2 /* send and receive */) = 0 The medium issue was that by supporting multiple calls to recv(), we woke up the bad old guy who sends us POLLERR status causing aborts before recv() has a chance to read a full response. This happens on the second recv() if the server has closed a bit quickly and sent an RST : 21:11:21.036600 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 993) = 1 21:11:21.054361 gettimeofday({1268770281, 54467}, NULL) = 0 21:11:21.054540 recv(7, H..., 8030, 0) = 1 21:11:21.054694 recv(7, 0x967e759, 8029, 0) = -1 EAGAIN (Resource temporarily unavailable) 21:11:21.054843 epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}}, 8, 975) = 1 21:11:21.060274 gettimeofday({1268770281, 60386}, NULL) = 0 21:11:21.060454 close(7)= 0 The fix simply consists in removing this old obsolete test, and now it's OK : 21:11:59.402207 recv(7, H..., 8030, 0) = 1 21:11:59.402362 recv(7, 0x8b5c759, 8029, 0) = -1 EAGAIN (Resource temporarily unavailable) 21:11:59.402511 epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}}, 8, 974) = 1 21:11:59.407242 gettimeofday({1268770319, 407353}, NULL) = 0 21:11:59.407425 recv(7, TTP/1.0 200 OK\r\n..., 8029, 0) = 16 21:11:59.407606 recv(7, 0x8b5c769, 8013, 0) = -1 ECONNRESET (Connection reset by peer) 21:11:59.407753 shutdown(7, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) The trained eye would have noticed that I got a two-packet HTTP response with an RST detected before the second packet and that the check code still managed to get it right. This is really good news ! I'm now gathering my changes and committing your patch with the small fixes above. That way we can concentrate on ECV. Cheers, Willy
Re: Truncated health check response from real servers
Hello Willy. On 8 March 2010 21:25, Willy Tarreau w...@1wt.eu wrote: On Mon, Mar 08, 2010 at 04:32:54PM +, Nick Chalk wrote: Attached is a patch against v1.4.1. It contains the updated ECV patch, and the hacks to work around check responses that span multiple packets. At first glance, it seems fine. I'll have to find my notes about the original patch to see if anything I spot back then is still present, but in my opinion it looks OK. Thanks. I don't remember if there were any doc in the original patch either, I'll have to dig. No, I don't think we have any documentation. I'll try to find some time to add that to the patch. Nick. -- Nick Chalk. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Truncated health check response from real servers
Hello Willy. On 3 March 2010 20:31, Willy Tarreau w...@1wt.eu wrote: OK that's perfect then. If you don't manage to sort out your issue with small packets, do not hesitate to post your work in progress to the list, it often helps a lot to work iteratively. The small-packet problem turned out to be a testing fault - the apache servers I'm using for testing were not sending the whole page when I had set a small MTU. Attached is a patch against v1.4.1. It contains the updated ECV patch, and the hacks to work around check responses that span multiple packets. It seems to be working fine in my tests, and handles check responses that are bigger than the buffer. I'd be grateful for anyone's comments and suggestions. Nick. -- Nick Chalk. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/ diff -ur haproxy-1.4.1/include/types/proxy.h haproxy-1.4.1-ecv/include/types/proxy.h --- haproxy-1.4.1/include/types/proxy.h 2010-03-04 22:39:19.0 + +++ haproxy-1.4.1-ecv/include/types/proxy.h 2010-03-08 10:38:22.0 + @@ -137,6 +137,8 @@ #define PR_O2_MYSQL_CHK 0x0002 /* use MYSQL check for server health */ #define PR_O2_USE_PXHDR 0x0004 /* use Proxy-Connection for proxy requests */ #define PR_O2_CHK_SNDST 0x0008 /* send the state of each server along with HTTP health checks */ +#define PR_O2_EXPECT 0x0010 /* http-check expect sth */ +#define PR_O2_NOEXPECT 0x0020 /* http-check expect ! sth */ /* end of proxy-options2 */ /* bits for sticking rules */ @@ -274,6 +276,9 @@ int grace;/* grace time after stop request */ char *check_req; /* HTTP or SSL request to use for PR_O_HTTP_CHK|PR_O_SSL3_CHK */ int check_len;/* Length of the HTTP or SSL3 request */ + char *expect_str; /* http-check expected content */ + regex_t *expect_regex; /* http-check expected content */ + char *expect_type; /* type of http-check, such as status, string */ struct chunk errmsg[HTTP_ERR_SIZE]; /* default or customized error messages for known errors */ int uuid;/* universally unique proxy ID, used for SNMP */ unsigned int backlog; /* force the frontend's listen backlog */ diff -ur haproxy-1.4.1/include/types/server.h haproxy-1.4.1-ecv/include/types/server.h --- haproxy-1.4.1/include/types/server.h 2010-03-04 22:39:19.0 + +++ haproxy-1.4.1-ecv/include/types/server.h 2010-03-08 10:38:22.0 + @@ -147,6 +147,9 @@ struct freq_ctr sess_per_sec; /* sessions per second on this server */ int puid;/* proxy-unique server ID, used for SNMP */ + char *check_data; /* storage of partial check results */ + int check_data_len; /* length of partial check results stored in check_data */ + struct { const char *file; /* file where the section appears */ int line; /* line where the section appears */ diff -ur haproxy-1.4.1/src/cfgparse.c haproxy-1.4.1-ecv/src/cfgparse.c --- haproxy-1.4.1/src/cfgparse.c 2010-03-04 22:39:19.0 + +++ haproxy-1.4.1-ecv/src/cfgparse.c 2010-03-08 10:38:22.0 + @@ -2874,8 +2874,65 @@ /* enable emission of the apparent state of a server in HTTP checks */ curproxy-options2 |= PR_O2_CHK_SNDST; } + else if (strcmp(args[1], expect) == 0) { + if (strcmp(args[2], status) == 0 || strcmp(args[2], string) == 0) { +curproxy-options2 |= PR_O2_EXPECT; +if (*(args[3]) == 0) { + Alert(parsing [%s:%d] : '%s %s %s' expects regex as an argument.\n, + file, linenum, args[0], args[1], args[2]); + return -1; +} +curproxy-expect_type = strdup(args[2]); +curproxy-expect_str = strdup(args[3]); + } +else if (strcmp(args[2], rstatus) == 0 || strcmp(args[2], rstring) == 0) { +curproxy-options2 |= PR_O2_EXPECT; +if (*(args[3]) == 0) { +Alert(parsing [%s:%d] : '%s %s %s' expects regex as an argument.\n, +file, linenum, args[0], args[1], args[2]); +return -1; +} +curproxy-expect_regex = calloc(1, sizeof(regex_t)); +if (regcomp(curproxy-expect_regex, args[3], REG_EXTENDED) != 0) { +Alert(parsing [%s:%d] : bad regular expression '%s'.\n, file, linenum, args[0]); +return -1; +} +curproxy-expect_type = strdup(args[2]); +} +else if (strcmp(args[2], !) == 0 ) { +curproxy-options2 |= PR_O2_NOEXPECT; +if (strcmp(args[3], status) == 0 || strcmp(args[3], string) == 0) { + if (*(args[4]) == 0) { + Alert(parsing [%s:%d] : '%s %s %s %s' expects regex as an
Re: Truncated health check response from real servers
Hello Willy. On 2 March 2010 21:45, Willy Tarreau w...@1wt.eu wrote: If your quick ack already works for one single check, then simply allocate a buffer for each server in cfgparse.c, and have the checks functions use that server-specific buffer instead of trash. Thanks for the pointer. I've added buffer and length variables to struct server, and allocated space with the rest of the server initialisation. Looks to be stable so far, although there's still a problem with very small response packets - something I'll work on tomorrow. If you come up with something working well, I'm willing to merge it into 1.4, so please do not hesitate to submit your work in progress. You'll also get more testers. When I have something working and presentable, I'll submit the patch. Thanks, Nick. -- Nick Chalk. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Truncated health check response from real servers
On Wed, Mar 03, 2010 at 08:12:09PM +, Nick Chalk wrote: Hello Willy. On 2 March 2010 21:45, Willy Tarreau w...@1wt.eu wrote: If your quick ack already works for one single check, then simply allocate a buffer for each server in cfgparse.c, and have the checks functions use that server-specific buffer instead of trash. Thanks for the pointer. I've added buffer and length variables to struct server, and allocated space with the rest of the server initialisation. Looks to be stable so far, although there's still a problem with very small response packets - something I'll work on tomorrow. If you come up with something working well, I'm willing to merge it into 1.4, so please do not hesitate to submit your work in progress. You'll also get more testers. When I have something working and presentable, I'll submit the patch. OK that's perfect then. If you don't manage to sort out your issue with small packets, do not hesitate to post your work in progress to the list, it often helps a lot to work iteratively. Cheers, Willy
Re: Truncated health check response from real servers
Hello Willy. On 25 February 2010 20:25, Willy Tarreau w...@1wt.eu wrote: On Mon, Feb 15, 2010 at 10:05:57AM +, Nick Chalk wrote: On 13 February 2010 10:40, Willy Tarreau w...@1wt.eu wrote: Indeed, with MSG_PEEK we have no way to tell the connection was closed. For the time being, I've hacked together a patch to get our customer up and running. I've allocated a new static character buffer, to store the intermediate results from the real server. I'm relying on recv() returning a length of 0 to indicate the server has closed the connection - not sure if that's a reliable method, but it seems to be repeatable. It will not always work with MSG_PEEK. Sorry, I should have made it clear that I'd removed the MSG_PEEK flag. The other temporary solution will be to allocate a receive buffer to every server to store health check responses. Hmm, I hadn't considered that multiple checks could be running in parallel. I'll have a look at whether that can break my quick hack. Thanks, Nick. -- Nick Chalk. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Truncated health check response from real servers
Hi Nick, On Tue, Mar 02, 2010 at 12:25:06PM +, Nick Chalk wrote: Hello Willy. On 25 February 2010 20:25, Willy Tarreau w...@1wt.eu wrote: On Mon, Feb 15, 2010 at 10:05:57AM +, Nick Chalk wrote: On 13 February 2010 10:40, Willy Tarreau w...@1wt.eu wrote: Indeed, with MSG_PEEK we have no way to tell the connection was closed. For the time being, I've hacked together a patch to get our customer up and running. I've allocated a new static character buffer, to store the intermediate results from the real server. I'm relying on recv() returning a length of 0 to indicate the server has closed the connection - not sure if that's a reliable method, but it seems to be repeatable. It will not always work with MSG_PEEK. Sorry, I should have made it clear that I'd removed the MSG_PEEK flag. The other temporary solution will be to allocate a receive buffer to every server to store health check responses. Hmm, I hadn't considered that multiple checks could be running in parallel. I'll have a look at whether that can break my quick hack. If your quick ack already works for one single check, then simply allocate a buffer for each server in cfgparse.c, and have the checks functions use that server-specific buffer instead of trash. You can even allocate smaller buffers than trash, because we don't necessarily need to receive and process large check responses. If you come up with something working well, I'm willing to merge it into 1.4, so please do not hesitate to submit your work in progress. You'll also get more testers. Thanks, Willy
Re: Truncated health check response from real servers
Hi Nick, On Mon, Feb 15, 2010 at 10:05:57AM +, Nick Chalk wrote: Hello Willy, Krzysztof. On 13 February 2010 10:40, Willy Tarreau w...@1wt.eu wrote: On Fri, Feb 12, 2010 at 05:47:41PM +0100, Krzysztof Ol??dzki wrote: There are several issues with the fix: - we need to check if connection is not closed, as it is pointless to use MSG_PEEK and restarting such check if there is no more data we are able to read Indeed, with MSG_PEEK we have no way to tell the connection was closed. For the time being, I've hacked together a patch to get our customer up and running. I've allocated a new static character buffer, to store the intermediate results from the real server. I'm relying on recv() returning a length of 0 to indicate the server has closed the connection - not sure if that's a reliable method, but it seems to be repeatable. It will not always work with MSG_PEEK. It will only return that if the server closes without sending any data. However if it sends some data first, every subsequent recv(MSG_PEEK) will return those data, so you'll never detect the close. That's what Krzysztof explained when he said that we'd rely on timeouts even when servers return short responses. Maybe we could try to use some non-portable methods to try to detect a close (eg: getsockopt(fd, SOL_SOCKET, SO_RCVBUF) to see if the input buffer is closed, etc...). But I don't even think it would work, and even if it did it would be terribly tricky. The other temporary solution will be to allocate a receive buffer to every server to store health check responses. This might not be that hard in fact, and maybe we can even do that during 1.4 maintenance as it would be a nice improvement with very little risk of regression. Regards, Willy
Re: Truncated health check response from real servers
Hello Willy, Krzysztof. On 13 February 2010 10:40, Willy Tarreau w...@1wt.eu wrote: On Fri, Feb 12, 2010 at 05:47:41PM +0100, Krzysztof Olędzki wrote: There are several issues with the fix: - we need to check if connection is not closed, as it is pointless to use MSG_PEEK and restarting such check if there is no more data we are able to read Indeed, with MSG_PEEK we have no way to tell the connection was closed. For the time being, I've hacked together a patch to get our customer up and running. I've allocated a new static character buffer, to store the intermediate results from the real server. I'm relying on recv() returning a length of 0 to indicate the server has closed the connection - not sure if that's a reliable method, but it seems to be repeatable. - some servers return empty description so increasing minimum response length prevents haproxy from accepting such checks. Of course, if you are not using such server, it should be safe to do it in your locally patched version, but we mustn't do it on a public version. In fact, we should re-parse the response each time we call recv(). As long as we don't find a complete response, we can wait. This still implies a non-trivial change to current code. I decided not to run the content check for every received packet, as I couldn't see an easy way to deal with the case where the string to match is split between two packets. Nick. -- Nick Chalk. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: Truncated health check response from real servers
On Fri, Feb 12, 2010 at 05:47:41PM +0100, Krzysztof Ol??dzki wrote: There are several issues with the fix: - we need to check if connection is not closed, as it is pointless to use MSG_PEEK and restarting such check if there is no more data we are able to read Indeed, with MSG_PEEK we have no way to tell the connection was closed. - some servers return empty description so increasing minimum response length prevents haproxy from accepting such checks. Of course, if you are not using such server, it should be safe to do it in your locally patched version, but we mustn't do it on a public version. In fact, we should re-parse the response each time we call recv(). As long as we don't find a complete response, we can wait. This still implies a non-trivial change to current code. - it may interfere with other, non http checks. Long term we should implement the Willy's idea and merge http session processing and http checks, but for now I'll try to fix it in 1.4 with respect to mentioned above problems. This part is even more important with ECV checks BTW. Regards, Willy
Re: Truncated health check response from real servers
On Wed, Feb 10, 2010 at 10:56:14PM +, Nick Chalk wrote: Thanks Willy, Krzysztof. 2010/2/10 Willy Tarreau w...@1wt.eu: Did you managed to fix the several remaining issues which could cause it to crash the process ? I believe so, following Cyril Bonté's suggestions last week. I'm still testing it, though. OK, we talked with Cyril about all the issues in this patch, so it's possible that you both finally got it right ! 2010/2/10 Krzysztof Ol??dzki o...@ans.pl: On 2010-02-10 23:02, Willy Tarreau wrote: Krzysztof Oledsky proposed a patch to make use of recv(MSG_PEEK) in order to leave incomplete data in kernel buffers instead of consuming it. I don't think it could have any side effect, you may want to try it. It was about 3-4 weeks ago on the list. Yep, it is worth to try it, but it is still a dirty fix. I have idea how to make it right, but haven't been able to find time to do it, yet. Thanks - I'll try that tomorrow, and report my findings. Thanks! What was confusing me was that the test would succeed a fair proportion of the time! That's typical of such packet-based parsing. In general, a good way to check if this is what you're seeing is to slow down the machine you run haproxy on (run a few while :; do :; done in parallel) then see if it starts working again, due to the fact that when haproxy gets the CPU, most of the time all packets have reached the kernel buffers. Regards, Willy