See https://bugs.launchpad.net/juju-core/+bug/1514874.
When starting, the machine agent starts up several workers in a runner (more or less the same in 1.25 and 2.0). Then it waits for the runner to stop. If one of the workers has a "fatal" error then the runner stops and the machine agent handles the error. If the error is worker.ErrTerminateAgent then the machine agent uninstalls (cleans up) itself. This amounts to deleting its data dir and uninstalling its jujud and mongo init services. However, all other files are left in place* and the machine/instance is not removed from Juju (agent stays "lost"). Notably, the unit agent handles fatal errors a bit differently, with some retry logic (e.g. for API connections) and never cleans up after itself. The problem here is that a fatal error *may* be recoverable with manual intervention or with retries. Cleaning up like we do makes it harder to manually recover. Such errors are theoretically extremely uncommon so we don't worry about doing anything other than stop and "uninstall". However, as seen with the above-referenced bug report, bugs can lead to fatal errors happening with enough frequency that the agent's clean-up becomes a pain point. The history of this behavior isn't particularly clear so I'd first like to hear what the original rationale was for cleaning up the agent when "terminating" it. Then I'd like to know if perhaps that rationale has changed. Finally, I'd like us to consider alternatives that allow for better recoverability. Regarding that third part, we have a number of options. I introduced several in the above-referenced bug report and will expand on them here: 1. do nothing This would be easy :) but does not help with the pain point. 2. be smarter about retrying (e.g. retry connecting to API) when running into fatal errors This would probably be good to do but the effort might not pay for itself. 3. do not clean up (data dir, init service, or either) Leaving the init service installed really isn't an option because we don't want the init service to try restarting the agent over and over. Leaving the data dir in place isn't good because it will conflict with any new agent dir the controller tries to put on the instance in the future. 4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the machine/model config or as a sentinel file on instances) and do not clean up if it is set to true (default?) This would provide a reasonable quick fix for the above bug, even if temporary and even if it defaults to false. 5. disable (instead of uninstall) the init services and move the data dir to some obvious but out-of-the-way place (e.g. /home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of deleting it This is a reasonable longer term solution as the concerns described for 3 are addressed and manual intervention becomes more feasible. 6. same as 5 but also provide an "enable-juju" (or "revive-juju", "repair-juju", etc.) command *on the machine* that would re-enable the init services and restore (or rebuild) the agent's data dir This would make it even easier to manually recover. 7. first try to automatically recover (from machine or controller) This would require a sizable effort for a problem that shouldn't normally happen. 8. do what the unit agent does I haven't looked closely enough to see if this is a good fit. I'd consider 4 with a default of false to be an acceptable quick fix. Additionally, I'll advocate for 6 (or 5) as the most appropriate solution in support of manual recovery from "fatal" errors. -eric * Could this lead to collisions if the instance is re-purposed as a different machine? I suppose it could also expose sensitive data when likewise re-purposed, since it won't necessarily be in the same model or controller. However, given the need for admin access that probably isn't a likely problem. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev