Re: [PATCH v6 00/22] Agent Check Enhancements and External Check

Willy Tarreau Fri, 06 Sep 2013 06:06:45 -0700

Hi Simon,

I'm finally done with the review. That was quite long (spread over two
weeks), and obviously less than what it might have taken you to do this
work. Next time if you can send smaller chunks it might help a lot !


Due to the difficulty to assign enough continuous time on it, I used
"git notes" to add notes to the commits, which allowed me to go back
and forth in the review. So I have not commented inline the code but
I have so few comments about the code by itself that it should not be
an issue at all, I hope you don't mind. Otherwise ask me for details
and I'll happily go into depth.

I found one concern at the end about the external check that's probably
worth a separate discussion thread. I'm suspecting that we should probably
get the 20 first patches merged ASAP and discuss the last two after that.

I'm appending the whole review here. This is a git log --reverse which
shows my comments below the Notes: tags.

Thanks very much for this work, I hope we can sort out all remaining
points to get it merged into 1.5-dev20.

Best regards,
Willy

---

commit 08eff21897f3f5d62f1360f309c5a71da6146afa
Author: Simon Horman <[email protected]>
Date:   Sat Feb 23 08:34:13 2013 +0900

    DOC: Clarify documentation of option lb-agent-chk
    
    Avoid referring to check-port as this is not a configuration parameter.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK

commit bff65d578c2e4212dfcbfa14ba549581ca42bc1d
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 07:45:56 2013 +0900

    CLEANUP: Make parameters of srv_downtime and srv_getinter const
    
    The parameters of srv_downtime and srv_getinter are not modified
    and thus may be const.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK

commit 9386a28b61cb7fb011ef2e710616b972f7bfe4e3
Author: Simon Horman <[email protected]>
Date:   Sat Feb 23 10:16:43 2013 +0900

    MEDIUM: Split up struct server's check element
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    
    The split has been made by:
    * Moving elements of struct server's check element that will
      be shared by both checks into a new check_common element
      of struct server.
    * Moving the remaining elements to a new struct check and
      making struct server's check element a struct check.
    * Adding a server element to struct check, a back-pointer
      to the server element it is a member of.
      - At this time the server could be obtained using
        container_of, however, this will not be so easy
        once a second struct check element is added to struct server
        to accommodate an agent health check.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v6
    * Correct use of srv->check_common.xprt ssl_sock_prepare_srv_ctx().
      Previously srv->check.xprt was used which causes a build failure
      when compiled with make USE_OPENSSL=1 ...
      Thanks to Scott McKeown for discovering this.
    
    v2 - v5
    * No change

Notes:
    OK.

commit ed27bc9796c5d0c06e1dc3dad4774510b3f1277a
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 09:09:03 2013 +0900

    MEDIUM: Move {,fast,down}inter to struct check
    
    Move {,fast,down}inter elements from struct server to struct check.
    This allows those elements of a check to be independent of the check's 
server.
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK though could probably be merged with previous one

commit 368f15d3ad805edbbc9070e708ce5b99624a8607
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 07:25:29 2013 +0900

    MEDIUM: Move result element to struct check
    
    Move result element from struct server to struct check
    This allows check results to be independent of the check's server.
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK.

commit f0d5a3c54696b4abd284b363adfb616cad0837ff
Author: Simon Horman <[email protected]>
Date:   Sat Feb 23 15:35:38 2013 +0900

    MEDIUM: Paramatise functions over the check of a server
    
    Paramatise the following functions over the check of a server
    
    * set_server_down
    * set_server_up
    * srv_getinter
    * server_status_printf
    * set_server_check_status
    * set_server_disabled
    
    Generally the server parameter of these functions has been removed.
    Where it is still needed it is obtained using check->server.
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    By paramatising these functions they may act on each of the checks
    without further significant modification.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    This is a rather large patch and I would be happy to split it up
    on request.
    
    v4
    * Use check->type in process_check()
    
      Use check->type instead of s->proxy->options2 & PR_O2_CHK_ANY
      as the former is specific to the check being processed whereas
      the latter relates only to the primary check of a server.
    
    * Add type to struct check
    
      This is used to indicate the type of a check independent of
      its server's proxy's check type.
    
    v2 - v3
    * No Change

Notes:
    - still need to understand the SSP_O_HCHK -> check change
    - set_server_disabled() : takes a check arg, converts it to server then
      uses &s->check, seems suspicious or at least confusing (why not simply 
check ?)
    
    - also :
                    for(srv = s->tracknext; srv; srv = srv->tracknext)
    -                       set_server_disabled(srv);
    +                       set_server_disabled(&s->check);
    
           isn't it &srv->check instead ?
    
    - it's nice to have a check being the owner of a connection instead of the
      server, looks much cleaner than what we used to have.
    - it should probably be made it more obvious what case is covered by 
!check->type
      (if relevant at this point)

commit 8c7409c8a6ddb4f3629fd877935355e0902ebfa5
Author: Simon Horman <[email protected]>
Date:   Sat Feb 23 15:14:19 2013 +0900

    MEDIUM: cfgparse: Factor out check initialisation
    
    This is in preparation for struct server having two elements
    of type struct check.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v5
    * Remove server argument from init_check. It is not used.
    * Set type in init_check
    
      This allows a zero type to be used to indicate that
      a task for a secondary agent check should not be stated.

Notes:
    OK

commit 73f8958f7c982aa0d376ab5b6d4baeb0892a61c2
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 17:23:38 2013 +0900

    MEDIUM: Add state to struct check
    
    Add state to struct check. This is currently used to store one bit,
    CHK_RUNNING, which is set if a check is running and clear otherwise.
    This bit was previously SRV_CHK_RUNNING of the state element of struct
    server.
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    Please use CHK_STATE_RUNNING instead of CHK_RUNNING so that it's
    easier to know it is related to the state attribute of the check.
    
    Also, we have many HCHK_* and now CHK_* as well, so I wonder how/when
    this will become confusing.

commit 8fe10dc6d602286db0270a0d1d27a3fa2907792e
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 21:50:46 2013 +0900

    MEDIUM: Add name element to struct check
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v5
    * Rebase

Notes:
    Is it worth adding a name field just to report it in the stats ? Don't
    we have everything required to display the proper name when dumping
    the stats ? (maybe not).

commit 59d4d5ae53c2225a6001be9d526de6623dba0abd
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 17:23:38 2013 +0900

    MEDIUM: Move health element to struct check
    
    This is in preparation for associating a agent check
    with a server which runs as well as the server's existing check.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK

commit 4f7700d5da3c22444f16716d105e013ef69ad0b0
Author: Simon Horman <[email protected]>
Date:   Sun Feb 24 20:50:33 2013 +0900

    MEDIUM: Add helper for task creation for checks
    
    This helper is in preparation for adding a second struct check element
    to struct server.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v5
    * Rebase

Notes:
    Cosmetic, but the cfgparse change just changes the previous patch, so it 
could
    have been merged there.
    
    Please change the "polarity" of the start_check_task() function or make it
    very clear that it returns an error code, or change the check below :
    
    -          if (start_check_task(&s->check, mininter, nbcheck, srvpos++))
    +          if (start_check_task(&s->check, mininter, nbcheck, srvpos++) < 0)
                       return -1
    
    The problem is I'm reading it as "if we could start a check task, return 
-1".
    And yes, I know that all the kernel's networking code you're dealing with
    every day works this way so this can be confusing.
    
    Last cosmetic point, please keep srvpos++ out of the condition (I thought
    we lost it). I tend to discourage people from doing variable assignments
    in conditions because these are the places that are commented out the first
    during debugging sessions, causing the accidental loss of the assignment.
    
    Otherwise OK.

commit cf63f97ef6fa2ef584caab595c06738262f3ac80
Author: Simon Horman <[email protected]>
Date:   Sat Feb 23 15:10:54 2013 +0900

    MEDIUM: checks: Add supplementary agent checks
    
    Allow an agent check to be run in conjunction with one other server
    health check.
    
    If the backend for a server check is not lb-agent-chk then an agent
    check may also be run using the agent-check parameter to a server,
    which sets the TCP port to be used for the agent check.
    
    e.g.
    server      web1_1 127.0.0.1:80 check agent-port 10000
    
    The agent-inter parameter may also be used to specify the interval
    and timeout for agent checks.
    
    If either the health or agent check determines that a server is down
    then it is marked as being down, otherwise it is marked as being up.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v5
    * Rebase for removal of server argument from init_check
    * Rebase for setting of type in init_check
    
    v4
    *  Increment global.maxsock for agent-port.
    
       If agent-port is configured then an extra socket is required.
    
    * Do not send requests to secondary agent checks
    
      The request configuration of a proxy relates to the primary health
      check and should not be sent to the secondary health check if it
      is in operation.
    
    * Correct usage of PR_O2_LB_AGENT_CHK
    
      The correct way to check for PR_O2_LB_AGENT_CHK is
      not to use x & PR_O2_LB_AGENT_CHK, but rather to use
      (x & PR_O2_CHK_ANY ) == PR_O2_LB_AGENT_CHK.
    
    v2 - v3
    * No change

Notes:
    The commit message confused me but fortunately the doc fixed me :-)
    
    I'm having an issue with using agent-port to automatically enable the
    check. We've done this mistake with the stats in the past. Specifying
    any stats setting enables the stats. The end result is that nobody
    sets the stats URL or passwords in a defaults section.
    
    Here we'll get the same issue. On large deployments, it's very likely
    that users will want to have "default-server agent-port 10000" in their
    defaults section, but then they will have no way to disable it for some
    servers.
    
    Thus I'd rather stick to the same logic we have for current health checks
    which consists in having the port being an independant and harmless setting,
    and have an "agent-check" keyword on the server lines to enable it where
    relevant.
    
    Please do not forget to mention the compatibility between agent-port and
    default-server in the doc.
    
    Also, just a stupid question, we've introduced option lb-agent-chk very
    recently and we're still in development. Don't you think it would be
    useful to remove it now before a release if agent-check can cover all
    its needs ? It would avoid having to deal with complex setups in the
    future. CCing the lb.org guys here who most likely use it already (and
    might probably plan a config migration).

commit 8abb1b04fcb258a6004d37fe2dd2436cb121007a
Author: Simon Horman <[email protected]>
Date:   Sun Mar 3 16:46:40 2013 +0900

    MEDIUM: Add helper for agent check events.
    
    Break agent check handling out of event_srv_chk_r().
    
    This is in preparation for supporting agent check results
    returned in an HTTP header.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK

commit 77950728605ced6093a829b08ffc5b7d5ecc0b6d
Author: Simon Horman <[email protected]>
Date:   Sun Mar 3 16:22:45 2013 +0900

    MEDIUM: Parser to allow matching of HTTP header
    
    Replace the current header parser, which simply skips the headers,
    with a version that allows matching of a key in a header.
    
    This is in preparation for supporting agent check results
    returned in an HTTP header.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    I don't understand the parser :-( Some comments on the inputs/outputs of the
    function are probably needed since they're clearly not obvious.
    
    Also I fear that it does not support partial responses anymore, but I may
    be wrong (ie: wait for more data to come when the response is not complete).
    
    Anyway, this change must be tagged as MAJOR since it will definitely cause
    regressions in obscure places : I'm seeing the parser is stricter than it
    was, and people who implement script-based health checks tend to emit awful
    HTTP-like responses which certainly don't qualify as valid but which were
    accepted. So whatever we can do to optimise compatibility with previous
    checks should be considered (especially since we're speaking about checks
    and not production traffic).

commit 0b829e506657e0d18cc8ffd64808f8cb0e379168
Author: Simon Horman <[email protected]>
Date:   Sun Mar 3 17:50:39 2013 +0900

    MEDIUM: Add http-check agent-hdr option
    
    Allow agent checks to obtain information in
    an HTTP header of the response to an http check.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v4
    * Do not duplicate code that is in process_result()
    
    v2 - v3
    * No change

Notes:
    OK in principle, but what does it do when agent-port is set ?
    Is there any conflict or do we face a race between the two competing
    checks ?
    
    In the doc :
    
     http-check agent-hdr
       Use the value of the given http header as the result of an agent check
       May be used in sections :   defaults | frontend | listen | backend
                                     yes   |    no    |   yes  |   yes
       Arguments :
    
         <header>  The header string to use to send the server name
    
    Please mention "<header>" on the syntax line itself :
    
     http-check agent-hdr <header>

commit 410090bf97c2553da33f799c6dc313fb91b13ead
Author: Simon Horman <[email protected]>
Date:   Wed Jun 5 18:06:07 2013 +0900

    MEDIUM: Add DRAIN state and report it on the stats page
    
    Add a DRAIN sub-state for a server which
    will be shown on the stats page instead of UP if
    an agent check is in use and the agent has most recently returned "drain".
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    In stats_dump_sv_stats() we use 'state' to display a new state, but the
    CSS is not updated, so we'll end up with any formating or colorset. I
    think it would be good to use :
      - the same color as "UP 2/3" (yellow) for "DRAIN 2/3"
      - a light brown such as #CC9900 when the DRAIN is in effect
    
    Also please update the comments at the top of the function to reflect
    the new input range.

commit 5a53bc5b4aebbfb499b96a5d3dadde514d885aaf
Author: Simon Horman <[email protected]>
Date:   Wed Jun 26 23:38:56 2013 +0900

    MEDIUM: Log agent fail, stopped or down as info
    
    In the case where an agent check returns fail, stopped or down,
    log this as info when logging the server status along with any
    trailing message returned by the agent after fail, stopped or down.
    
    Previously only the trailing message was logged as info and
    if omitted no info was logged.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    OK

commit 5ab56ee1510abfc79ec1f46657a49abb47da6824
Author: Simon Horman <[email protected]>
Date:   Wed Jun 5 18:05:05 2013 +0900

    MEDIUM: Do not mark a server as down if the agent is unavailable
    
    In the case where agent-port is used and the agent
    check is a secondary check to not mark a server as down
    if the agent becomes unavailable.
    
    In this configuration the agent should only cause a server to be marked
    as down if the agent returns "fail", "stopped" or "down".
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v4
    * Never update health of agent on failure to connect
    
      Previously the health was recharged on such cases.
      But this causes the health to be recharged even
      if the most recent result from the agent was "fail", "stopped", or "down".
    
      The result was unnecessary logging of "fail", "stopped", or "down"
      results from the agent if they were interleaved with timeouts.
      Or in other words, if the agent responded only occasionally
      but consistently with "fail", "stopped", or "down".
    
    v3
    * No change
    
    v2
    * Only log agent server status if the change in status
      will affect the server's state

Notes:
    While I absolutely agree with the goal described in the commit message,
    I fail to connect it to what I'm seeing in the code. I don't know if it
    is a result of the rework of the patches but for me all it does it apply
    a second DRAIN state whose purpose I don't understand. I'm seeing that
    we first set WILLDRAIN, then update the status, then add DRAIN on top of
    that, that remains unclear to me. I'm not sure what situation the 4
    combinations of these flags correspond to.

commit 3980cce19837da833cf0c5ced030060257b9df0c
Author: Simon Horman <[email protected]>
Date:   Thu Jun 27 21:15:04 2013 +0900

    MEDIUM: Set rise and fall of agent checks to 1
    
    This is achieved by moving rise and fall from struct server to struct check.
    
    After this move the behaviour of the primary check, server->check is
    unchanged. However, the secondary agent check, server->agent now has
    independent rise and fall values each of which are set to 1.
    
    The result is that receiving "fail", "stopped" or "down" just once from the
    agent will mark the server as down. And receiving a weight just once will
    allow the server to be marked up if its primary check is in good health.
    
    This opens up the scope to allow the rise and fall values of the agent
    check to be configurable, however this has not been implemented at this
    stage.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v7
    When checking health against rise in set_server_up the health of a check
    should be checked against the rise of the same check. Otherwise the 
condition
    for the s->agent may be false when it should be true as s->agent's rise and
    thus healthy health value is 1 whereas the default rise for check is likely 
to
    be 3, as check may be s->check and the the default value for s->check.rise 
is 3.
    
    v5
    * Rebase

Notes:
    One remaining "server" below :
    
    -       int health;                             /* 0 to server->rise-1 = 
bad;
    -                                                * rise to 
server->rise+server->fall-1 = good */
    +       int health;                             /* 0 to rise-1 = bad;
    +                                                * rise to 
rise+server->fall-1 = good */
    +       int rise, fall;                         /* time in iterations */
    
    Otherwise OK.

commit fa2d91c61ce78d24143a74508d1dc8c85fca53f1
Author: Simon Horman <[email protected]>
Date:   Sat Jun 22 14:36:39 2013 +0900

    MEDIUM: Add set agent pause|unpause unix socket command
    
    The syntax of this new command is:
    
    set agent <backend>/<server> pause|unpause
    
    This command changes the behaviour of agent checks as follows:
    
    In the case where an agent check is being run as a secondary check,
    due to the agent-port parameter of a server directive, new checks
    are only initialised when the agent is in the unpaused state. Thus,
    setting the agent to pause will prevent any new agent checks from
    begin initiated until the agent is set to unpaused.
    
    When set pause is in effect the processing of an agent check run as the
    primary check, due to either option lb-agent-check or http-check
    agent-hdr being set, or processing of a secondary agent check that was
    initiated while the agent was set as unpaused is as follows: All results
    that would alter the weight, specifically "drain" or a weight returned by
    the agent, are ignored. The processing of agent check is otherwise
    unchanged.
    
    The motivation for this option is to allow weight setting is to allow the
    weight changing effects of the agent checks to be paused to allow the
    weight of a server to be configured using set weight without being
    overridden by the agent.
    
    The default state is unpaused.
    
    Signed-off-by: Simon Horman <[email protected]>

Notes:
    I think that instead of setting the agent status to pause/unpaused, we'd
    rather reuse the same principle as we currently have with servers :
    
         enable agent <backend>/<server>
         disable agent <backend>/<server>
    
    OK for the rest of course.

commit e4cbb6e2d82ad7d037316ab0dc4932e16a3b4e14
Author: Simon Horman <[email protected]>
Date:   Tue Mar 13 14:02:40 2012 +0900

    MEDIUM: Break out check establishment into establish_chk()
    
    This is in preparation for adding a new type of check that
    uses a process rather than a socket.
    
    Signed-off-by: Simon Horman <[email protected]>
    
    ---
    
    v4
    * Use check->type in establish_conn_chk()
    
      Use check->type instead of s->proxy->options2 & PR_O2_CHK_ANY
      as the former is specific to the check being processed whereas
      the latter relates only to the primary check of a server.
    
    v2 - v3
    * No change

Notes:
    I'd rather call it connect_chk() since we already use the verb
    "connect" at other places for the same purpose.

commit 23bc14cb0389d6ecb525c76766181ef01938a6cf
Author: Simon Horman <[email protected]>
Date:   Tue Mar 13 14:03:22 2012 +0900

    MEDIUM: Add external check
    
    Add an external check which makes use of an external process to
    check the status of a server.
    
    ---
    
    v6
    * Correct implementation and documentation of arguments to external-check
      command so that they are consistent with both each other and ldirectord's
      external check. The motivation being to allow the same scripts to be used
      with both haproxy and ldirectord.
    
    v5
    * Rebase
    
    v4
    * Remove stray use of s->check in process_chk()
      The check parameter should be used throughout process_chk()
    * Layer 7 timeouts of agent checks should be ignored
    * Ensure that argc is never used uninitialised in prepare_external_check()
    
    v3
    * Rebase: basically a rewrite of large sections of the code
    * Merge with the following patches
      + "external-check: Actually execute command"
      + "Allow selection of of external-check in configuration file"
    
    v2
    * If the external command exits normally (WIFEXITED()) is true)
      then set the check's code to the exit status (WEXITSTATUS())
      of the process.
    * Treat a timeout is a failure case rather than the test having passed
    * Remove duplicate getnameinfo() call in start_checks()
    * Remove duplicate assignment of sockaddr argument to getnameinfo(9
      which caused the check port and check addr configuration of
      a server to be ignored.

Notes:
    Problems :
      - don't use "option xxx <arg>" despite the old poor choice introduced
        with "option httpchk". Options are meant to be booleans.
      - we need some way to ensure that we can globally disable this feature
        or better, that it must be explicitly enabled, as it introduces a
        major security risk (APIs, ...)
      - probably that we need to have a global prefix path setting for all
        commands.
      - anonymous union does not build with gcc-2.95 (still supported)
      - please don't use stdbool (again), it is not portable and has already
        broke twice in the past.
      - getnameinfo is not portable either, better use str2sa() or equivalent
        (and remove include netdb)
      - I'm quite worried about the changes with has_conn in process_chk,
        they're touching a very sensible place, so I think it would be
        much better to have a distinct process_chk function for external
        checks (we can even rename process_chk to process_chk_conn)

--------

Re: [PATCH v6 00/22] Agent Check Enhancements and External Check

Reply via email to