just my tuppence... Would it not be clearer to add an additional command to implement your proposal? E.g. "add-manager" and possibly "destroy/remove-manager" This could also support switches for later fine control, and possibly be less open to misinterpretation than overloading the add-machine command?
Nick On Wed, Nov 6, 2013 at 6:49 PM, roger peppe <rogpe...@gmail.com> wrote: > The current plan is to have a single "juju ensure-ha-state" juju > command. This would create new state server machines if there are less > than the required number (currently 3). > > Taking that as given, I'm wondering what we should do > in the future, when users require more than a single > big On switch for HA. > > How does the user: > > a) know about the HA machines so the costs of HA are not hidden, and that > the implications of particular machine failures are clear? > > b) fix the system when a machine dies? > > c) scale up the system to x thousand nodes? > > d) scale down the system? > > For a), we could tag a machine in the status as a "state server", and > hope that the user knows what that means. > > For b) the suggestion is that the user notice that a state server machine > is non-responsive (as marked in status) and runs destroy-machine on it, > which will notice that it's a state server machine and automatically > start another one to replace it. Destroy-machine would refuse to work > on a state server machine that seems to be alive. > > For c) we could add a flag to ensure-ha-state suggesting a desired number > of state-server nodes. > > I'm not sure what the suggestion is for d) given that we refuse to > destroy live state-server machines. > > Although ensure-ha-state might be a fine way to turn > on HA initially I'm not entirely happy with expanding it to cover > all the above cases. It seems to me like we're going > to create a leaky abstraction that purports to be magic ("just wave the > HA wand!") and ends up being limiting, and in some cases confusing > ("Huh? I asked to destroy that machine and there's another one > just been created") > > I believe that any user that's using HA will need to understand that > some machines are running state servers, and when things fail, they > will need to manage those machines individually (for example by calling > destroy-machine). > > I also think that the solution to c) is limiting, because there is > actually no such thing as a "state server" - we have at least three > independently scalable juju components (the database servers (mongodb), > the API servers and the environment managers) with different scaling > characteristics. I believe that in any sufficiently large environment, > the user will not want to scale all of those at the same rate. For example > MongoDB will allow at most 12 members of a replica set, but a caching API > server could potentially usefully scale up much higher than that. We could > add more flags to ensure-ha-state (e.g.--state-server-count) but we then > we'd lack the capability to suggest which might be grouped with which. > > PROPOSAL > > My suggestion is that we go for a "slightly less magic" approach. > that provides the user with the tools to manage > their own high availability set up, adding appropriate automation in time. > > I suggest that we let the user know that machines can run as juju server > nodes, and provide them with the capability to *choose* which machines > will run as server nodes and which can host units - that is, what *jobs* > a machine will run. > > Here's a possible proposal: > > We already have an "add-machine" command. We'd add a "--jobs" flag > to allow the user to specify the jobs that the new machine(s) will > run. Initially we might have just two jobs, "manager" and "unit" > - the machine can either host service units, or it can manage the > juju environment (including running the state server database), > or both. In time we could add finer levels of granularity to allow > separate scalability of juju server components, without losing backwards > compatibility. > > If the new machine is marked as a "manager", it would run a mongo > replica set peer. This *would* mean that it would be possible to have > an even number of mongo peers, with the potential for a split vote > if the nodes were partitioned evenly, and resulting database stasis. > I don't *think* that would actually be a severe problem in practice. > We would make juju status point out the potential problem very clearly, > just as it should point out the potential problem if one of an existing > odd-sized replica set dies. The potential problems are the same in both > cases, and are straightforward for even a relatively naive user to avoid. > > Thus, juju ensure-ha-state is almost equivalent to: > > juju add-machine --jobs manager -n 2 > > In my view, this command feels less "magic" than ensure-ha-state - the > runtime implication (e.g. cost) of what's going on are easier for the > user to understand and it requires no new entities in a user's model of > the system. > > In addition to the new add-machine flag, we'd add a single new command, > "juju machine-jobs", which would allow the user to change the jobs > associated with an existing machine. That could be a later addition - > it's not necessary in the first cut. > > With these primitives, I *think* the responsibilities of the system and > the model to the user become clearer. Looking back to the original > user questions: > > a) The "state manager" status of certain machines in the status is no > longer something entirely divorced from user control - it means something > in terms of the commands the user is provided with. > > b) The user already knows about destroy-machine. They can manage broken > state manager machines just as they would manage any other broken machine. > Destroy-machine would refuse to destroy the any state server machine that > would take the currently connected set of mongo peers below a majority. > > c) We already have add-machine. > > d) We already have destroy-machine. See c) above. > > REQUEST FOR COMMENTS > > If there is broad agreement on the above, then I propose that > we start off by implementing ensure-ha-state with as little > internal logic as possible - we don't necessarily need to put the > transactional > logic in to make sure it works without starting extra machines > when many people are calling ensure-ha-state concurrently, for example. > > In fact, ensure-ha-state could probably be written as a thin layer > on top of add-machine --jobs. > > Thoughts? > > -- > Juju-dev mailing list > Juju-dev@lists.ubuntu.com > Modify settings or unsubscribe at: > https://lists.ubuntu.com/mailman/listinfo/juju-dev -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev