On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkins <[email protected] > wrote:
> The reason it's done at the last moment is to avoid having dangling >> database entries. If we uninstall the agent (i.e. delete /var/lib/juju, >> remove systemd/upstart), then if the agent fails before we get to >> EnsureDead, then the entity will never be removed from state. >> > > The *only* thing that should happen after setting dead is the uninstall -- > anything else that's required to happen before cleanup *must* happen before > setting dead, which *means* "all my responsibilities are 100% fulfilled". > > I don't think I suggested above that we should do anything else other than uninstall? Sorry I misunderstood -- I was focused on the loop-device stuff that had grown in the uninstall logic. > The *only* justification for the post-death logic in the manual case is >> because there's no responsible provisioner component to hand over to -- and >> frankly I wish we'd just written that to SSH in and clean up, instead of >> taking on this ongoing hassle. >> > >> > As an alternative, we could (should) only ever write the >>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking >>> there's no assigned units, and no storage attached. >>> >> >> Why would we *ever* want to write it at runtime? We know if it's a manual >> machine at provisioning time, so we can write the File Of Death OAOO. All >> the other mucking about with it is the source of these (serious!) bugs. >> > > The point is not to distinguish between manual vs. non-manual. Yes, we can > write something that records that fact OAOO. > > The point of "write from the machiner" was to signal that the machine is > actually dead, and removed from state, vs. "my credentials are invalid, > better shut down now". > Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on mere invalid-credentials. (No worker should return ErrTerminateAgent *at all*, really -- not their job. Having a couple of different workers that can return ErrAgentEntityDead, which can then be interpreted by the next layer? Fine :).) > So we can write a file to confine uninstall to manual machines -- that > much is easy, I don't think anyone will disagree with doing that. But we > should not ignore the bug that prompted this thread, even if it's confined > to manual machines. > Right; and, yeah, it's almost certainly better to leak manual machines than it is to uninstall them accidentally -- so long as we know that's the tradeoff we're making ;). > Andrew, I think you had more detail last time we discussed this: is there >>>> anything else in uninstall (besides loop-device stuff) that needs to run >>>> *anywhere* except a manual machine? and, what will we actually need to sync >>>> with in the machiner? (or, do you have alternative ideas?) >>>> >>> >>> No, I don't think there is anything else to be done in uninstall, apart >>> from loop detach and manual machine cleanup. I'm not sure about moving the >>> uninstall logic to the machiner, for reasons described above. We could >>> improve the current state of affairs, though, by only writing the >>> uninstall-agent file from the machiner >>> >> >> Strong -1 on moving uninstall logic: if it has to happen (which it does, >> in *rare* cases that are *always* detectable pre-provisioning), uninstall >> is where it should happen, post-machine-death; and also strong -1 on >> writing uninstall-agent in *any* circumstances except manual machine >> provisioning, we have had *way* too many problems with this "clever" >> feature being invoked when it shouldn't be. >> > > I don't want to belabour the point, but just to be clear: the > uninstall-agent file exists to record the fact that he machine is in fact > Dead, and uninstall should go ahead. That logic was put in specifically to > prevent the referenced bug. We can and should improve it to only do this > for manual machines. > Meta: a name like "uninstall-agent" is really misleading if it actually means "machine-X is definitely dead". I never had the slightest indication that was what it was meant to mean. But in that case, why *don't* we write the file on certain API connection failures? There is logic that *explicitly* checks if the machine is Dead -- surely that's a trigger too? FWIW, the loop stuff can be dropped when the LXC container support is >>> removed. Nobody ever added support for loop in the LXD provider, and I >>> think we should implement support for it differently to how it was done for >>> LXC anyway (losetup on host, expose to container; as opposed to expose all >>> loop devices to all LXD containers and losetup in container). >>> >> >> +1000 to that. So... can't we just (1) fix the manual provisioning to >> write the file; (2) drop all other use of uninstall-agent; (3) drop the >> lxc-specific logic in uninstall -- and then we're done? >> > > For first steps, I think so. But I do think we should fix > https://bugs.launchpad.net/juju-core/+bug/1514874 for manual machines as > well. So: > > (1) record (in agent config?) that a machine is manual > (2) only ever do anything uninstall-related for manual machines > (3) only ever do uninstall-related things if the machine actually is Dead > (4) drop lxc-specific logic from uninstall *when LXC support is removed* > Generally SGTM, but confused re (4) -- doesn't that have to be fixed/moved/removed first earlier, so that we can switch off the suicide behaviour in other cases without regressing? Cheers William
-- Juju-dev mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
