This write up is related to Avocado's issue #4994, which can be found at https://github.com/avocado-framework/avocado/issues/4994 .
Intro ===== Avocado's nrunner architecture separates the components that prepare and start the environment (a spawner) where a task (usually a test, but also test's requirements) will be executed. A problem arises where either the spawner or the task itself allocates resources and to clean them up. Spawners may have limited visibility on the resources created by a task (either directly or by the underlying runner). For instance, the ``exec-test`` runner will create a new process that will actually run the executable test. If such a process misbehaves and hangs, it may be left consuming CPU resources "forever". That is not ideal, and "someone" should clean up after the misbehaving executable. The question becomes: whose responsibility is to keep track and clean up such resources? The obvious choices are: 1. The spawner that started the task 2. The runner (create by the task) that actually created the resource The biggest problem with the first option (the spawner) is that it may have limited (or too coarse) visibility on the resources that were actually created by the task (or runner). The biggest problem with the second option is that every runner will need to implement similar resource tracking and clean up. Using the existing spawners, and also the spawners under development, as examples, we can see give some more concrete examples: A. The "process" spawner is able to know with good enough confidence that one or more processes were created, either directly (the task process) or by the runner within the task. It can guess that because it can look for all children processes of the task it created. B. The "podman" spawner is able to leverage the container technology itself to clean up all the resources within the container (such as many processes that may have been created by the task and runner. C. The "remote" spawner may have a much harder time to identify resources that were created by a task or runner it started. It may require multiple sessions or multiple remote command executions to query the children processes of the task. If the system is unresponsive because of a runaway task or test, that will become harder or even impossible, and the "remote" spawner may only be able to clean up the session it currently holds. So far, the scope of this resource management discussion is limited to processes. The situation changes completely if resources, such as persistent storage, is considered. The problem is not so much related to cleaning up resources themselves (such as removing a directory created by a task or runner), but how to clean up those resources in different environment (say locally with the process spawner, in a different machine with the remote spawner, etc). Again, this is left to be discussed at a later time. Proposal ======== For the first milestone of this work, my proposal is to: 1. Properly document the capabilities of every spawner, including their potential ability and strategy to destroy a task. 2. Provide the currently missing, but expected, capabilities of spawners with regards to resources clean up. 3. Present to the user the list of resources that could the spawner could not guarantee that were destroyed. Documentation ------------- Letting users know about the capabilities and caveats of each spawner is the first step towards predictability and a more complete (future) set of features regarding resource management. The goal here is to let users know what they can rely on, and what they can't. For instance, it'd be fair to document the "remote" spawner more or less along the following lines: "The remote spawner can properly destroy the SSH session it starts and maintains with the remote machine, and does a best effort attempt to destroy the task it started, but does **not** attempt to find and clean up all processes that the runner started". Missing Features ---------------- The spawner interface "destroy_task()" currently won't bother checking if its attempt to clean up the task was successful. Also, children processes of a task are not accounted for. To improve this situation the following can and should be done: 1. The process spawner should verify if the task process was terminated. 2. The process spawner should try harder to terminate (and then kill) the task's children processes. 3. The podman spawner should make sure that the container was fully terminated. Accountability -------------- By keeping track of the success (or failure) to clean up resources, it's possible to let users know and manually take action to clean up missed resources. For instance, if the podman spawner fails to terminate a container, it should, at the very least, log a message such as: "PodmanSpawner failed to destroy container with ID deadbeefdeadbeef. Podman reported error: xxxxxxxxx". Future ====== In the future, the resource managent could be expanded such that: a. Different resources are accounted for (for example, storage) b. Support for "mix-and-match" of spawners and resources, such as being able to manage a "storage" resource on both the "process" and "remote" spawners. There's no attempt at this point to determine the effort and feasibility of any of these.
