On Fri, 2009-05-15 at 09:25 +1000, Simon Horman wrote: > On Thu, May 14, 2009 at 05:35:27PM -0400, Sean Millichamp wrote: > > This implements a global or per-virtual service option indicating how > > many successive check failures must occur before ldirectord will remove > > (or quiesce) the real server. A check success will reset the counter of > > failures.
> I wonder if it would be better for it to replace the existing > checkcount implementation. Which seems to be the same but only implemented > for ping checks. failurecount does seem to be a better name, perhaps > it could be an alias for checkcount or vice versa? I agree that failurecount could probably replace checkcount with one caveat: If administrators have a checkcount set globally that they expect to only impact pings and we alias it to impact everything then an update will change the behavior they had before. Now, you could argue that an administrator of a Linux load balancer running ldirectord should know better than to upgrade a key piece of software without reading the documentation and changelogs. I would agree, but surely someone will upgrade without doing that level of investigation. I feel like the right approach is to leave both included, for now, and maybe mark checkcount as deprecated with a configuration warning that it will be removed in an future release. Thoughts? > > +The number of consecutive times a failure will have to be reported by a > > +check before the realserver is removed from the kernel's LVS table. A > > +value of 1 will have the realserver removed on the first failure. A > > +successful check will reset the failure counter to 0. > > I think it would be better to say something like "before the realserver > is considred to have failed" as if quiescent=yes the server won't > actually be removed from the kernel's LVS table. Agreed. [crazy patch snipped] > This logic looks a bit hairy, though not unexpectedly so. > How well tested is it? Indeed! It took me a while to unravel exactly what was going on and how it needed to behave. We have been running this patch for about 2 weeks now without issues. Our heavily loaded servers will still fail checks periodically (as evidenced by output in the monitorfile) but they don't get removed unless the checks continue to fail. Also, prior to deploying it I tried to simulate a number of failure scenarios that we might see in our environment (quiescent=yes, fork=yes, connect and negotiate check types) and all worked as desired. I am attaching a patch on the previous patch that adds deprecation warnings and cleans up the failurecount documentation (I don't see a way in hg to export a combined patch). Sean
# HG changeset patch # User Sean E. Millichamp <[email protected]> # Date 1242400953 14400 # Node ID bb6b83aafa92492e70658f3a9dfc813a910be64a # Parent fdfc9ac249a2497eda1407c88bef5ce102086cf3 ldirectord: Add checkcount deprecation warnings and minor failurecount doc cleanup. Signed-Off-By: Sean E. Millichamp <[email protected]> diff -r fdfc9ac249a2 -r bb6b83aafa92 ldirectord/ldirectord.in --- a/ldirectord/ldirectord.in Mon May 11 14:44:12 2009 -0400 +++ b/ldirectord/ldirectord.in Fri May 15 11:22:33 2009 -0400 @@ -155,6 +155,9 @@ B<checkcount = >I<n> +This option is deprecated and slated for removal in a future version. +Please see the 'failurecount' option. + The number of times a check will be attmpted before it is considered to have failed. Only works with ping checks. Note that the checktimeout/negotiatetimeout is additive, so if a connect check is used, @@ -168,9 +171,9 @@ B<failurecount = >I<n> The number of consecutive times a failure will have to be reported by a -check before the realserver is removed from the kernel's LVS table. A -value of 1 will have the realserver removed on the first failure. A -successful check will reset the failure counter to 0. +check before the realserver is considered to have failed. A +value of 1 will have the realserver considered failed on the first failure. +A successful check will reset the failure counter to 0. If defined in a virtual server section then the global value is overriden. @@ -1293,6 +1296,7 @@ } elsif ($rcmd =~ /^checkcount\s*=\s*(.*)/){ $1 =~ /(\d+)/ && $1 or &config_error($line, "invalid check count"); $vsrv{checkcount} = $1; + &config_warn($line, "checkcount option is deprecated and slated for removal. please see 'failurecount'"); } elsif ($rcmd =~ /^failurecount\s*=\s*(.*)/){ $1 =~ /(\d+)/ && $1 or &config_error($line, "invalid failure count"); $vsrv{failurecount} = $1; @@ -1504,6 +1508,7 @@ $1 =~ /(\d+)/ && $1 or &config_error($line, "invalid check count value"); $CHECKCOUNT = $1; + &config_warn($line, "checkcount option is deprecated and slated for removal. please see 'failurecount'"); } elsif ($linedata =~ /^failurecount\s*=\s*(.*)/) { $1 =~ /(\d+)/ && $1 or &config_error($line, "invalid failure count value");
_______________________________________________ Please read the documentation before posting - it's available at: http://www.linuxvirtualserver.org/ LinuxVirtualServer.org mailing list - [email protected] Send requests to [email protected] or go to http://lists.graemef.net/mailman/listinfo/lvs-users
