Hi Simon, I'm finally done with the review. That was quite long (spread over two weeks), and obviously less than what it might have taken you to do this work. Next time if you can send smaller chunks it might help a lot !
Due to the difficulty to assign enough continuous time on it, I used "git notes" to add notes to the commits, which allowed me to go back and forth in the review. So I have not commented inline the code but I have so few comments about the code by itself that it should not be an issue at all, I hope you don't mind. Otherwise ask me for details and I'll happily go into depth. I found one concern at the end about the external check that's probably worth a separate discussion thread. I'm suspecting that we should probably get the 20 first patches merged ASAP and discuss the last two after that. I'm appending the whole review here. This is a git log --reverse which shows my comments below the Notes: tags. Thanks very much for this work, I hope we can sort out all remaining points to get it merged into 1.5-dev20. Best regards, Willy --- commit 08eff21897f3f5d62f1360f309c5a71da6146afa Author: Simon Horman <[email protected]> Date: Sat Feb 23 08:34:13 2013 +0900 DOC: Clarify documentation of option lb-agent-chk Avoid referring to check-port as this is not a configuration parameter. Signed-off-by: Simon Horman <[email protected]> Notes: OK commit bff65d578c2e4212dfcbfa14ba549581ca42bc1d Author: Simon Horman <[email protected]> Date: Sun Feb 24 07:45:56 2013 +0900 CLEANUP: Make parameters of srv_downtime and srv_getinter const The parameters of srv_downtime and srv_getinter are not modified and thus may be const. Signed-off-by: Simon Horman <[email protected]> Notes: OK commit 9386a28b61cb7fb011ef2e710616b972f7bfe4e3 Author: Simon Horman <[email protected]> Date: Sat Feb 23 10:16:43 2013 +0900 MEDIUM: Split up struct server's check element This is in preparation for associating a agent check with a server which runs as well as the server's existing check. The split has been made by: * Moving elements of struct server's check element that will be shared by both checks into a new check_common element of struct server. * Moving the remaining elements to a new struct check and making struct server's check element a struct check. * Adding a server element to struct check, a back-pointer to the server element it is a member of. - At this time the server could be obtained using container_of, however, this will not be so easy once a second struct check element is added to struct server to accommodate an agent health check. Signed-off-by: Simon Horman <[email protected]> --- v6 * Correct use of srv->check_common.xprt ssl_sock_prepare_srv_ctx(). Previously srv->check.xprt was used which causes a build failure when compiled with make USE_OPENSSL=1 ... Thanks to Scott McKeown for discovering this. v2 - v5 * No change Notes: OK. commit ed27bc9796c5d0c06e1dc3dad4774510b3f1277a Author: Simon Horman <[email protected]> Date: Sun Feb 24 09:09:03 2013 +0900 MEDIUM: Move {,fast,down}inter to struct check Move {,fast,down}inter elements from struct server to struct check. This allows those elements of a check to be independent of the check's server. This is in preparation for associating a agent check with a server which runs as well as the server's existing check. Signed-off-by: Simon Horman <[email protected]> Notes: OK though could probably be merged with previous one commit 368f15d3ad805edbbc9070e708ce5b99624a8607 Author: Simon Horman <[email protected]> Date: Sun Feb 24 07:25:29 2013 +0900 MEDIUM: Move result element to struct check Move result element from struct server to struct check This allows check results to be independent of the check's server. This is in preparation for associating a agent check with a server which runs as well as the server's existing check. Signed-off-by: Simon Horman <[email protected]> Notes: OK. commit f0d5a3c54696b4abd284b363adfb616cad0837ff Author: Simon Horman <[email protected]> Date: Sat Feb 23 15:35:38 2013 +0900 MEDIUM: Paramatise functions over the check of a server Paramatise the following functions over the check of a server * set_server_down * set_server_up * srv_getinter * server_status_printf * set_server_check_status * set_server_disabled Generally the server parameter of these functions has been removed. Where it is still needed it is obtained using check->server. This is in preparation for associating a agent check with a server which runs as well as the server's existing check. By paramatising these functions they may act on each of the checks without further significant modification. Signed-off-by: Simon Horman <[email protected]> --- This is a rather large patch and I would be happy to split it up on request. v4 * Use check->type in process_check() Use check->type instead of s->proxy->options2 & PR_O2_CHK_ANY as the former is specific to the check being processed whereas the latter relates only to the primary check of a server. * Add type to struct check This is used to indicate the type of a check independent of its server's proxy's check type. v2 - v3 * No Change Notes: - still need to understand the SSP_O_HCHK -> check change - set_server_disabled() : takes a check arg, converts it to server then uses &s->check, seems suspicious or at least confusing (why not simply check ?) - also : for(srv = s->tracknext; srv; srv = srv->tracknext) - set_server_disabled(srv); + set_server_disabled(&s->check); isn't it &srv->check instead ? - it's nice to have a check being the owner of a connection instead of the server, looks much cleaner than what we used to have. - it should probably be made it more obvious what case is covered by !check->type (if relevant at this point) commit 8c7409c8a6ddb4f3629fd877935355e0902ebfa5 Author: Simon Horman <[email protected]> Date: Sat Feb 23 15:14:19 2013 +0900 MEDIUM: cfgparse: Factor out check initialisation This is in preparation for struct server having two elements of type struct check. Signed-off-by: Simon Horman <[email protected]> --- v5 * Remove server argument from init_check. It is not used. * Set type in init_check This allows a zero type to be used to indicate that a task for a secondary agent check should not be stated. Notes: OK commit 73f8958f7c982aa0d376ab5b6d4baeb0892a61c2 Author: Simon Horman <[email protected]> Date: Sun Feb 24 17:23:38 2013 +0900 MEDIUM: Add state to struct check Add state to struct check. This is currently used to store one bit, CHK_RUNNING, which is set if a check is running and clear otherwise. This bit was previously SRV_CHK_RUNNING of the state element of struct server. This is in preparation for associating a agent check with a server which runs as well as the server's existing check. Signed-off-by: Simon Horman <[email protected]> Notes: Please use CHK_STATE_RUNNING instead of CHK_RUNNING so that it's easier to know it is related to the state attribute of the check. Also, we have many HCHK_* and now CHK_* as well, so I wonder how/when this will become confusing. commit 8fe10dc6d602286db0270a0d1d27a3fa2907792e Author: Simon Horman <[email protected]> Date: Sun Feb 24 21:50:46 2013 +0900 MEDIUM: Add name element to struct check This is in preparation for associating a agent check with a server which runs as well as the server's existing check. Signed-off-by: Simon Horman <[email protected]> --- v5 * Rebase Notes: Is it worth adding a name field just to report it in the stats ? Don't we have everything required to display the proper name when dumping the stats ? (maybe not). commit 59d4d5ae53c2225a6001be9d526de6623dba0abd Author: Simon Horman <[email protected]> Date: Sun Feb 24 17:23:38 2013 +0900 MEDIUM: Move health element to struct check This is in preparation for associating a agent check with a server which runs as well as the server's existing check. Signed-off-by: Simon Horman <[email protected]> Notes: OK commit 4f7700d5da3c22444f16716d105e013ef69ad0b0 Author: Simon Horman <[email protected]> Date: Sun Feb 24 20:50:33 2013 +0900 MEDIUM: Add helper for task creation for checks This helper is in preparation for adding a second struct check element to struct server. Signed-off-by: Simon Horman <[email protected]> --- v5 * Rebase Notes: Cosmetic, but the cfgparse change just changes the previous patch, so it could have been merged there. Please change the "polarity" of the start_check_task() function or make it very clear that it returns an error code, or change the check below : - if (start_check_task(&s->check, mininter, nbcheck, srvpos++)) + if (start_check_task(&s->check, mininter, nbcheck, srvpos++) < 0) return -1 The problem is I'm reading it as "if we could start a check task, return -1". And yes, I know that all the kernel's networking code you're dealing with every day works this way so this can be confusing. Last cosmetic point, please keep srvpos++ out of the condition (I thought we lost it). I tend to discourage people from doing variable assignments in conditions because these are the places that are commented out the first during debugging sessions, causing the accidental loss of the assignment. Otherwise OK. commit cf63f97ef6fa2ef584caab595c06738262f3ac80 Author: Simon Horman <[email protected]> Date: Sat Feb 23 15:10:54 2013 +0900 MEDIUM: checks: Add supplementary agent checks Allow an agent check to be run in conjunction with one other server health check. If the backend for a server check is not lb-agent-chk then an agent check may also be run using the agent-check parameter to a server, which sets the TCP port to be used for the agent check. e.g. server web1_1 127.0.0.1:80 check agent-port 10000 The agent-inter parameter may also be used to specify the interval and timeout for agent checks. If either the health or agent check determines that a server is down then it is marked as being down, otherwise it is marked as being up. Signed-off-by: Simon Horman <[email protected]> --- v5 * Rebase for removal of server argument from init_check * Rebase for setting of type in init_check v4 * Increment global.maxsock for agent-port. If agent-port is configured then an extra socket is required. * Do not send requests to secondary agent checks The request configuration of a proxy relates to the primary health check and should not be sent to the secondary health check if it is in operation. * Correct usage of PR_O2_LB_AGENT_CHK The correct way to check for PR_O2_LB_AGENT_CHK is not to use x & PR_O2_LB_AGENT_CHK, but rather to use (x & PR_O2_CHK_ANY ) == PR_O2_LB_AGENT_CHK. v2 - v3 * No change Notes: The commit message confused me but fortunately the doc fixed me :-) I'm having an issue with using agent-port to automatically enable the check. We've done this mistake with the stats in the past. Specifying any stats setting enables the stats. The end result is that nobody sets the stats URL or passwords in a defaults section. Here we'll get the same issue. On large deployments, it's very likely that users will want to have "default-server agent-port 10000" in their defaults section, but then they will have no way to disable it for some servers. Thus I'd rather stick to the same logic we have for current health checks which consists in having the port being an independant and harmless setting, and have an "agent-check" keyword on the server lines to enable it where relevant. Please do not forget to mention the compatibility between agent-port and default-server in the doc. Also, just a stupid question, we've introduced option lb-agent-chk very recently and we're still in development. Don't you think it would be useful to remove it now before a release if agent-check can cover all its needs ? It would avoid having to deal with complex setups in the future. CCing the lb.org guys here who most likely use it already (and might probably plan a config migration). commit 8abb1b04fcb258a6004d37fe2dd2436cb121007a Author: Simon Horman <[email protected]> Date: Sun Mar 3 16:46:40 2013 +0900 MEDIUM: Add helper for agent check events. Break agent check handling out of event_srv_chk_r(). This is in preparation for supporting agent check results returned in an HTTP header. Signed-off-by: Simon Horman <[email protected]> Notes: OK commit 77950728605ced6093a829b08ffc5b7d5ecc0b6d Author: Simon Horman <[email protected]> Date: Sun Mar 3 16:22:45 2013 +0900 MEDIUM: Parser to allow matching of HTTP header Replace the current header parser, which simply skips the headers, with a version that allows matching of a key in a header. This is in preparation for supporting agent check results returned in an HTTP header. Signed-off-by: Simon Horman <[email protected]> Notes: I don't understand the parser :-( Some comments on the inputs/outputs of the function are probably needed since they're clearly not obvious. Also I fear that it does not support partial responses anymore, but I may be wrong (ie: wait for more data to come when the response is not complete). Anyway, this change must be tagged as MAJOR since it will definitely cause regressions in obscure places : I'm seeing the parser is stricter than it was, and people who implement script-based health checks tend to emit awful HTTP-like responses which certainly don't qualify as valid but which were accepted. So whatever we can do to optimise compatibility with previous checks should be considered (especially since we're speaking about checks and not production traffic). commit 0b829e506657e0d18cc8ffd64808f8cb0e379168 Author: Simon Horman <[email protected]> Date: Sun Mar 3 17:50:39 2013 +0900 MEDIUM: Add http-check agent-hdr option Allow agent checks to obtain information in an HTTP header of the response to an http check. Signed-off-by: Simon Horman <[email protected]> --- v4 * Do not duplicate code that is in process_result() v2 - v3 * No change Notes: OK in principle, but what does it do when agent-port is set ? Is there any conflict or do we face a race between the two competing checks ? In the doc : http-check agent-hdr Use the value of the given http header as the result of an agent check May be used in sections : defaults | frontend | listen | backend yes | no | yes | yes Arguments : <header> The header string to use to send the server name Please mention "<header>" on the syntax line itself : http-check agent-hdr <header> commit 410090bf97c2553da33f799c6dc313fb91b13ead Author: Simon Horman <[email protected]> Date: Wed Jun 5 18:06:07 2013 +0900 MEDIUM: Add DRAIN state and report it on the stats page Add a DRAIN sub-state for a server which will be shown on the stats page instead of UP if an agent check is in use and the agent has most recently returned "drain". Signed-off-by: Simon Horman <[email protected]> Notes: In stats_dump_sv_stats() we use 'state' to display a new state, but the CSS is not updated, so we'll end up with any formating or colorset. I think it would be good to use : - the same color as "UP 2/3" (yellow) for "DRAIN 2/3" - a light brown such as #CC9900 when the DRAIN is in effect Also please update the comments at the top of the function to reflect the new input range. commit 5a53bc5b4aebbfb499b96a5d3dadde514d885aaf Author: Simon Horman <[email protected]> Date: Wed Jun 26 23:38:56 2013 +0900 MEDIUM: Log agent fail, stopped or down as info In the case where an agent check returns fail, stopped or down, log this as info when logging the server status along with any trailing message returned by the agent after fail, stopped or down. Previously only the trailing message was logged as info and if omitted no info was logged. Signed-off-by: Simon Horman <[email protected]> Notes: OK commit 5ab56ee1510abfc79ec1f46657a49abb47da6824 Author: Simon Horman <[email protected]> Date: Wed Jun 5 18:05:05 2013 +0900 MEDIUM: Do not mark a server as down if the agent is unavailable In the case where agent-port is used and the agent check is a secondary check to not mark a server as down if the agent becomes unavailable. In this configuration the agent should only cause a server to be marked as down if the agent returns "fail", "stopped" or "down". Signed-off-by: Simon Horman <[email protected]> --- v4 * Never update health of agent on failure to connect Previously the health was recharged on such cases. But this causes the health to be recharged even if the most recent result from the agent was "fail", "stopped", or "down". The result was unnecessary logging of "fail", "stopped", or "down" results from the agent if they were interleaved with timeouts. Or in other words, if the agent responded only occasionally but consistently with "fail", "stopped", or "down". v3 * No change v2 * Only log agent server status if the change in status will affect the server's state Notes: While I absolutely agree with the goal described in the commit message, I fail to connect it to what I'm seeing in the code. I don't know if it is a result of the rework of the patches but for me all it does it apply a second DRAIN state whose purpose I don't understand. I'm seeing that we first set WILLDRAIN, then update the status, then add DRAIN on top of that, that remains unclear to me. I'm not sure what situation the 4 combinations of these flags correspond to. commit 3980cce19837da833cf0c5ced030060257b9df0c Author: Simon Horman <[email protected]> Date: Thu Jun 27 21:15:04 2013 +0900 MEDIUM: Set rise and fall of agent checks to 1 This is achieved by moving rise and fall from struct server to struct check. After this move the behaviour of the primary check, server->check is unchanged. However, the secondary agent check, server->agent now has independent rise and fall values each of which are set to 1. The result is that receiving "fail", "stopped" or "down" just once from the agent will mark the server as down. And receiving a weight just once will allow the server to be marked up if its primary check is in good health. This opens up the scope to allow the rise and fall values of the agent check to be configurable, however this has not been implemented at this stage. Signed-off-by: Simon Horman <[email protected]> --- v7 When checking health against rise in set_server_up the health of a check should be checked against the rise of the same check. Otherwise the condition for the s->agent may be false when it should be true as s->agent's rise and thus healthy health value is 1 whereas the default rise for check is likely to be 3, as check may be s->check and the the default value for s->check.rise is 3. v5 * Rebase Notes: One remaining "server" below : - int health; /* 0 to server->rise-1 = bad; - * rise to server->rise+server->fall-1 = good */ + int health; /* 0 to rise-1 = bad; + * rise to rise+server->fall-1 = good */ + int rise, fall; /* time in iterations */ Otherwise OK. commit fa2d91c61ce78d24143a74508d1dc8c85fca53f1 Author: Simon Horman <[email protected]> Date: Sat Jun 22 14:36:39 2013 +0900 MEDIUM: Add set agent pause|unpause unix socket command The syntax of this new command is: set agent <backend>/<server> pause|unpause This command changes the behaviour of agent checks as follows: In the case where an agent check is being run as a secondary check, due to the agent-port parameter of a server directive, new checks are only initialised when the agent is in the unpaused state. Thus, setting the agent to pause will prevent any new agent checks from begin initiated until the agent is set to unpaused. When set pause is in effect the processing of an agent check run as the primary check, due to either option lb-agent-check or http-check agent-hdr being set, or processing of a secondary agent check that was initiated while the agent was set as unpaused is as follows: All results that would alter the weight, specifically "drain" or a weight returned by the agent, are ignored. The processing of agent check is otherwise unchanged. The motivation for this option is to allow weight setting is to allow the weight changing effects of the agent checks to be paused to allow the weight of a server to be configured using set weight without being overridden by the agent. The default state is unpaused. Signed-off-by: Simon Horman <[email protected]> Notes: I think that instead of setting the agent status to pause/unpaused, we'd rather reuse the same principle as we currently have with servers : enable agent <backend>/<server> disable agent <backend>/<server> OK for the rest of course. commit e4cbb6e2d82ad7d037316ab0dc4932e16a3b4e14 Author: Simon Horman <[email protected]> Date: Tue Mar 13 14:02:40 2012 +0900 MEDIUM: Break out check establishment into establish_chk() This is in preparation for adding a new type of check that uses a process rather than a socket. Signed-off-by: Simon Horman <[email protected]> --- v4 * Use check->type in establish_conn_chk() Use check->type instead of s->proxy->options2 & PR_O2_CHK_ANY as the former is specific to the check being processed whereas the latter relates only to the primary check of a server. v2 - v3 * No change Notes: I'd rather call it connect_chk() since we already use the verb "connect" at other places for the same purpose. commit 23bc14cb0389d6ecb525c76766181ef01938a6cf Author: Simon Horman <[email protected]> Date: Tue Mar 13 14:03:22 2012 +0900 MEDIUM: Add external check Add an external check which makes use of an external process to check the status of a server. --- v6 * Correct implementation and documentation of arguments to external-check command so that they are consistent with both each other and ldirectord's external check. The motivation being to allow the same scripts to be used with both haproxy and ldirectord. v5 * Rebase v4 * Remove stray use of s->check in process_chk() The check parameter should be used throughout process_chk() * Layer 7 timeouts of agent checks should be ignored * Ensure that argc is never used uninitialised in prepare_external_check() v3 * Rebase: basically a rewrite of large sections of the code * Merge with the following patches + "external-check: Actually execute command" + "Allow selection of of external-check in configuration file" v2 * If the external command exits normally (WIFEXITED()) is true) then set the check's code to the exit status (WEXITSTATUS()) of the process. * Treat a timeout is a failure case rather than the test having passed * Remove duplicate getnameinfo() call in start_checks() * Remove duplicate assignment of sockaddr argument to getnameinfo(9 which caused the check port and check addr configuration of a server to be ignored. Notes: Problems : - don't use "option xxx <arg>" despite the old poor choice introduced with "option httpchk". Options are meant to be booleans. - we need some way to ensure that we can globally disable this feature or better, that it must be explicitly enabled, as it introduces a major security risk (APIs, ...) - probably that we need to have a global prefix path setting for all commands. - anonymous union does not build with gcc-2.95 (still supported) - please don't use stdbool (again), it is not portable and has already broke twice in the past. - getnameinfo is not portable either, better use str2sa() or equivalent (and remove include netdb) - I'm quite worried about the changes with has_conn in process_chk, they're touching a very sensible place, so I think it would be much better to have a distinct process_chk function for external checks (we can even rename process_chk to process_chk_conn) --------

