On Mon, May 9, 2016 at 3:28 AM, Andrew Wilkins <[email protected] > wrote:
> On Sat, May 7, 2016 at 1:37 AM William Reade <[email protected]> > wrote: > >> On Fri, May 6, 2016 at 5:50 PM, Eric Snow <[email protected]> >> wrote: >> >>> See https://bugs.launchpad.net/juju-core/+bug/1514874. >> >> > So I think this issue is fixed in 2.0, but looks like the changes never > got backported to 1.25. From your options, we do have (the opposite of) a > DO_NOT_UNINSTALL file (it's actually called > "/var/lib/juju/uninstall-agent"; only if it exists do we uninstall). > > (And now that I think of it, we're only writing uninstall-agent for the > manual provider's bootstrap machine, and not other manual machines, so > we're currently leaving Juju bits behind on manual machines added to an > environment.) > Except we're *also* writing it on every machine, for Very Bad Reasons, right? So we *are* still cleaning up all machines, but there's a latent manual provider bug that'll need addressing. > The reason it's done at the last moment is to avoid having dangling > database entries. If we uninstall the agent (i.e. delete /var/lib/juju, > remove systemd/upstart), then if the agent fails before we get to > EnsureDead, then the entity will never be removed from state. > The *only* thing that should happen after setting dead is the uninstall -- anything else that's required to happen before cleanup *must* happen before setting dead, which *means* "all my responsibilities are 100% fulfilled". The *only* justification for the post-death logic in the manual case is because there's no responsible provisioner component to hand over to -- and frankly I wish we'd just written that to SSH in and clean up, instead of taking on this ongoing hassle. As an alternative, we could (should) only ever write the > /var/lib/juju/uninstall-agent file from worker/machiner, first checking > there's no assigned units, and no storage attached. > Why would we *ever* want to write it at runtime? We know if it's a manual machine at provisioning time, so we can write the File Of Death OAOO. All the other mucking about with it is the source of these (serious!) bugs. Andrew, I think you had more detail last time we discussed this: is there >> anything else in uninstall (besides loop-device stuff) that needs to run >> *anywhere* except a manual machine? and, what will we actually need to sync >> with in the machiner? (or, do you have alternative ideas?) >> > > No, I don't think there is anything else to be done in uninstall, apart > from loop detach and manual machine cleanup. I'm not sure about moving the > uninstall logic to the machiner, for reasons described above. We could > improve the current state of affairs, though, by only writing the > uninstall-agent file from the machiner > Strong -1 on moving uninstall logic: if it has to happen (which it does, in *rare* cases that are *always* detectable pre-provisioning), uninstall is where it should happen, post-machine-death; and also strong -1 on writing uninstall-agent in *any* circumstances except manual machine provisioning, we have had *way* too many problems with this "clever" feature being invoked when it shouldn't be. FWIW, the loop stuff can be dropped when the LXC container support is > removed. Nobody ever added support for loop in the LXD provider, and I > think we should implement support for it differently to how it was done for > LXC anyway (losetup on host, expose to container; as opposed to expose all > loop devices to all LXD containers and losetup in container). > +1000 to that. So... can't we just (1) fix the manual provisioning to write the file; (2) drop all other use of uninstall-agent; (3) drop the lxc-specific logic in uninstall -- and then we're done? Cheers William
-- Juju-dev mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
