Forgot to reply to all: Willy,
This looks good to me and make sense. Long term it will be more flexible this way. On 4 December 2013 18:17, Willy Tarreau <[email protected]> wrote: > Hi Malcolm, > > On Wed, Dec 04, 2013 at 03:05:41PM +0000, Malcolm Turnbull wrote: >> Hi Willy, >> >> Sorry for the lack of response from the Loadbalancer.org end, I must >> confess we were getting a bit confused by the descriptions :-). > > I'm not surprized! I got even more confused when trying to debug some > of the issues Igor reported and not understanding what would act on > what, what would be propagated from tracked servers, etc... Anyway, > writing the design limitations here and explaining them helps us > get rid of them. > >> The only thing in mu mind to be aware of is the design decision of the >> agent to report DOWN or DRAIN on every agent request until the agent >> starts responding with x% again.. >> Was because if you send an UP response from the agent how does the >> agent know that HAProxy has read that value and acted on it? It would >> need to know when it was safe to start responding with x% again? > > OK I get your point. My point was to emit two things at once. > Eg: "UP 10%". > > We could have the agent specification state that the response format > may include optional state words, optionally followed by a weight. > That way we can have agents which return state only, weight only or > both. > >> Our primary requirement at Loadbalancer.org is for the first scenario >> i.e. dynamic weight adjustment and uses standard health checks: >> >> - inform the load balancer about the server's load to adjust the >> weights, but not interact with the service's state which is >> monitored using regular checks. It basically replaces the job >> of the admin who would constantly re-adjust weights depending >> on the servers load. > > I agree that this should be by far the most common use especially in > combination with the service check. That's the reason why I'm embarrassed > by the fact that we put the server UP when returning a percentage because > it means the agent returning the load has to be aware of the service state > which is not logical. > >> The following usage case makes sense, but isn't really a priority for us: >> >> - offer a complete health check system to services which are not >> easily checkable. In this case they would simply be used without >> a regular check. This is more a service-level approach and not >> a server-level one. > > It's not my priority either though I know some people will want it when > they already have to use an agent and need to deploy a second script to > check the health of a specific service : they won't find it convenient > to run two scripts on different ports, one for the state and one for the > load. > >> The third logical function for us was: >> >> For a Windows administrator to have a simple GUI DRAIN/HALT button in >> the agent, to enable quick local maintenance on the Windows backend >> server without having to log into the load balancer in order to set >> maintenance mode. > > Hehe, just like the 404 feature in HTTP :-) > >> But again this is not really a priority with us as you say it clashes >> with the CLI DRAIN logic.... > > It does not exactly clash, it depends how we define it. I discovered there > are 3 dimensions which are managed by a single agent while we initially > thought there were only two. The agent can : > > - declare a service's state (up or down) > - declare an administrative state (drain/ready) > - declare a system load (weight) > > But at the moment with the language we defined, each action changes two > of them at once, which is a big problem. > > And depending on what system the agent will be deployed on, not all these > features will be used together. I expect that admin state and load will be > the more common ones for an agent. Your enumeration tends to support this. > > So let's try with something like this for the agent syntax : > > [keywords]* [weight] > > Where [keywords] are optional and made of : > > "up" : report that the service is UP. > "down", "stopped", "fail" : report the service down with these causes > "drain" : don't change the state, nor the weight, just set DRAIN mode. > "maint" : don't change the state, nor the weight, just set MAINT mode > "ready" : don't change the state, nor the weight, just leave MAINT and > DRAIN modes. > > And [weight] is optional and in the form "xxx%" to report the desired > weight for this server relative to the configured one in the config. > > Thus the following examples might illustrate it better : > > "up" : declare the server up, don't change the configured weight > "up 50%" : declare the server up, set weight to 50% > "50%" : don't touch the server state, just set the weight to 50% > "drain" : don't touch the state, nor weight, just switch to drain mode. > "maint" : force maintenance mode. > "drain 20%" : drain mode, adjust weight to 20% (not used in this mode but > will avoid complex logics in agent scripts) > "ready 30%" : leave maint/drain modes, start at 30% weight. > "up ready 40%" : the agent does the 3 things at once and says the service > is OK. > "stopped drain 10%" : the agent does the 3 things at once and indicates > that the > server is now down after drain mode. > > I remember we initially refrained from allowing the "maint" mode from the > agent in its first version because it was planned as a regular check and > we didn't want it to be stuck in this mode. But now that the agent runs > on its own, it makes much more sense since it will continue to be checked. > > With this, we can also consider that if a regular check is configured on the > server, then the state changes are ignored from the agent. This greatly > simplifies deployments relying on a single agent for multiple services > even if this agent was initially deployed for a specific service. > > We would have to improve the CLI and the stats interface to match that. We'd > change the "soft stop" in the stats interface to act on the DRAIN mode instead > of the weight. It would provide the same effect as today but in a more > consistent way. > > Proceeding like this, I can easily imagine that most agents will simply > read a small file containing the admin state (maint/drain/ready) and > that others will only report the idle CPU measure. > > What do you think ? > > Thanks, > Willy > -- Regards, Malcolm Turnbull. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/

