On Thu, May 21, 2020 at 12:46:46PM -0300, Beraldo Leal wrote: > Hi Cleber, > > Thank you for opening this discussion here so we can try contributing to > this core component. > > Since that this thread is bringing up a few discussions in different > parts (although related), for this reply, I will narrow down my comments > only with suggestions on how we could organize this effort, discussions > and ideas. > > As you said, some items here could be an issue on Github (epic or not) > and some could be a blueprint for a proper review and discussion. > > So, to split this work, have better estimations and track the progress > on this effort, one suggestion is to create goals or milestones with the > issues that have a high priority and that are blocking for any other > issue. >
Hi Beraldo, As we've defined that the project goals for the next release (81.0) include both the Job API and the N(ext) Runner, it simplifies the planning because we have *one* milestone for all related tasks. > For each topic that you explained here, we could try to answer a few > questions: > > - Is this an epic issue? Or a single and quick fix issue? > - If epic, what are the sub-issues? For the N(ext) Runner, this epic issue has available: https://github.com/avocado-framework/avocado/issues/3700 > - Is this blocking any other issue? > - Is this blocked by any other issue? > - Difficult level? > - Should we open a blueprint? I think only the scheduler one is worthy of blueprint at this time. > The N(ext) Runner related issues have been labeled with "nrun2run" label: https://github.com/avocado-framework/avocado/issues?q=is%3Aissue+label%3Anrun2run > Maybe this is very clear in your mind already, but IMO would be nice to > have this also exposed to Github (not only the issues but tracking the > progress. Maybe with a 'project, milestone or kanban column). > > Following this approach, I made some comments in-line: > > On Wed, May 20, 2020 at 07:32:59PM -0400, Cleber Rosa wrote: > > Known issues and limitations of the current implementation > > ========================================================== > > > > Different Test IDs > > ------------------ > > > > When running tests with the current runner, the Test IDs are different:: > > > > $ avocado run --test-runner=runner --json=- -- /bin/true /bin/false > > /bin/uname | grep \"id\" > > "id": "1-/bin/true", > > "id": "2-/bin/false", > > "id": "3-/bin/uname", > > > > $ avocado run --test-runner=nrunner --json=- -- /bin/true /bin/false > > /bin/uname | grep \"id\" > > "id": "1-1-/bin/true", > > "id": "2-2-/bin/false", > > "id": "3-3-/bin/uname", > > > > The goal is to make the IDs the same. > > This seems to be a non-epic issue. > Right, and a non-issue anymore, given that it was resolved in version 80.0: https://github.com/avocado-framework/avocado/issues/3864 > > Inability to run Tasks other than exec, exec-test, python-unittest (and > > noop) > > ----------------------------------------------------------------------------- > > > > The current implementation of the nrunner plugin is based on the fact that > > Tasks are already present at ``test_suite`` job attribute, and that running > > Tasks can be (but shouldn't always be) a matter of iterating of the result > > of its ``run()`` method. This is part of the actual code:: > > > > for status in task.run(): > > result_dispatcher.map_method('test_progress', False) > > statuses.append(status) > > > > The problem here is that only the Python classes implemented in the core > > "avocado.core.nrunner" module, and registered at: > > > > > > https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.nrunner.RUNNERS_REGISTRY_PYTHON_CLASS > > > > The goal is to have all other Python classes that inherit from > > "avocado.core.nrunner.BaseRunner" available in such a registry. > > IIUC, we can tackle this in multiple ways, from hard coded dicts to > decorators and "register" methods, so that third-party Runners can > register itself without the need to change the "avocado core code". > Right. > At first glance, this seems to not be a "blueprint" case, so I would > like to suggest opening a GH issue and if the discussion evolves to a > complex solution, then we could think of a blueprint. Probably you > already have thought about this, so, maybe when opening the issue add > there your suggestion to tackle this. > Agreed, not blueprint worthy, and the issue is here: https://github.com/avocado-framework/avocado/issues/3865 > > Inability to run Tasks with Spawners > > ------------------------------------ > > > > While the "avocado nrun" command makes use of the Spawners, the > > current implementation of the nrunner plugin described earlier, > > calls a Task's ``run()`` method directly, and clearly doesn't > > use spawners. > > > > The goal here is to leverage spawners so that other isolation > > models (or execution environments, depending how you look at > > processes, containers, etc) are supported. > > Same as previous comment. > Agreed, issue is: https://github.com/avocado-framework/avocado/issues/3866 > > Unoptmized execution of Tasks (extra serialization/deserialization) > > ------------------------------------------------------------------- > > > > At this time, the nrunner plugin runs a Task directly through its > > ``run()`` method. Besides the earlier point of not supporting > > other isolation models/execution environments (that means not using > > spawners), there's an extra layer of work happening when running > > a task which is most often not necessary: turning a Task instance > > into a command line, and within its execution, turning it into a > > Task instance again. > > > > The goal is to support an optmized execution of the tasks, without > > having to turn them into command lines, and back into Task instances. > > The idea is already present in the spawning method definitions: > > > > > > https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.spawners.html#avocado.core.spawners.common.SpawnMethod.PYTHON_CLASS > > > > And a PoC on top of the ``nrun`` command was implemented here: > > > > > > https://github.com/avocado-framework/avocado/pull/3766/commits/ae57ee78df7f2935e40394cdfc72a34b458cdcef > > IMO, this is a core and important change, that is introducing or > changing the task 'state machine', and maybe could receive an own > blueprint just to discuss this. > I think we will not be able to put it aside, as we're implementing the scheduler. But, time will tell. Anyway, there's an issue to track it here: https://github.com/avocado-framework/avocado/issues/3867 > > Proposal > > ======== > > > > Besides the known limitations listed previously, there are others that > > will appear along the way, and certainly some new challeges as we > > solve them. > > > > The goal of this proposal is to attempt to identify those challenges, > > and lay out a plan that can be tackled by the Avocado team/community > > and not by a single person. > > > > Task execution coordination goals > > --------------------------------- > > > > As stated earlier, to run a job, tasks must be executed. Differently > > than the current runner, the N(ext) Runner architecture allows those > > to be executed in a much more decoupled way. This characteristic will > > be maintained, but it needs to be adapted into the current Job > > execution. > > > > From a high level view, the nrunner plugin needs to: > > > > 1. Break apart from the "one at a time" Task execution model that it > > currently employs; > > > > 2. Check if a Tasks can be executed, that is, if its requirements can > > be fulfilled (the most basic requirement for a task is a matching > > runner; > > > > 3. Prepare for the execution of a task, such as the fulfillment of > > extra tasks requirements. The requirements resolver is one, if not > > the only way, component that should be given a chance to act here; > > > > 4. Executes a task in prepared environment; > > > > 5. Monitor the execution of a task (from an external PoV); > > > > 6. Collect the status messages that tasks will send; > > > > a. Forward the status messages to the appropriate job components, > > such as the result plugins. > > > > b. Depending on the content of messages, such as the ones > > containing "status: started" or "status: finished", interfere in > > the Task execution status, and consequently, in the Job > > execution status. > > > > 7. Verify, warn the user, and attempt to clean up stray tasks. This > > may be for instance, necessary if a Task on a container seems to > > be stuck, and the container can not be destroyed. The same applies > > to process in some time of uninterruptile sleeps. > > Maybe this could be an epic issue? > I think so, yes. Along with also producing a blueprint. > > Suggested terminology > > --------------------- > > > > Task execution has been requested > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > A Task whose execution was requested by the user. All of the tasks on > > a Job's ``test_suite`` attribute are requested tasks. > > > > If a software component deals with this type of task, it's advisable > > that it refers to ``TASK_REQUESTED`` or ``requested_tasks`` or a > > similar name that links to this definition. > > > > Task is being triaged > > ~~~~~~~~~~~~~~~~~~~~~ > > > > The details of the task are being analyzed, including and most > > importantly the ability of the system to *attempt* to fulfill its > > requirements. A task leaves triage and it's either considered > > "discarded" or proceeds to be prepared and then executed. > > > > If a software component deals with this type of task, for instance if > > a "task scheduler" is looking for runners matching the Task's kind, it > > should keep it under a ``tasks_under_triage`` or mark the tasks as > > ``UNDER_TRIAGE`` or ``TRIAGING`` a similar name that links to this > > definition. > > > > Task is being prepared > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > Task has left triage, and it has not been discarded, that is, it's > > a candidate to be setup, and if it goes well, executed. > > > > The requirements for a task are being prepared in its reespective > > isolation model/execution environment, that is, the spawner it'll > > be executed with is known, and the setup actions will be visible > > by the task. > > > > If a software component deals with this type of task, for instance the > > implementation of resolution of specific requirements, it should > > should keep it under a ``tasks_preparing`` or mark the tasks as > > ``PREPARING`` or a similar name that links to this definition. > > > > Task is ready to be started > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > Task has been prepared succesfully, and can now be executed. > > > > If a software component deals with this type of task, it should > > should keep it under a ``tasks_ready`` or mark the tasks as > > ``READY`` or a similar name that links to this definition. > > > > Task is being started > > ~~~~~~~~~~~~~~~~~~~~~ > > > > A hopefully short lived state, in which a task that is ready to be started > > (see previous point) will be given to the reespective spawner to be started. > > > > If a software component deals with this type of task, it should should > > keep it under a ``tasks_starting`` or mark the tasks as ``STARTING`` > > or a similar name that links to this definition. > > > > The spawner should know if the starting of the task succeeded or failed, > > and the task should be categorized accordingly. > > > > Task has been started > > ~~~~~~~~~~~~~~~~~~~~~ > > > > A task was successfully started by a spawner. > > > > Note that it does *not* mean that the test that the task runner (say, > > an "avocado-runner-$kind task-run" command) will run has already been > > started. This will be signalled by a "status: started" kind of > > message. > > > > If a software component deals with this type of task, it should should > > keep it under a ``tasks_started`` or mark the tasks as ``STARTED`` or > > a similar name that links to this definition. > > > > Task has failed to start > > ~~~~~~~~~~~~~~~~~~~~~~~~ > > > > Quite self explanatory. If the spawner failed to start a task, it > > should be kept under a ``tasks_failed_to_start`` structure or be > > marked as ``FAILED_TO_START`` or a similar name that links to this > > definition. > > > > Task is finished > > ~~~~~~~~~~~~~~~~ > > > > This means that the task has started, and is now finished. There's no > > associated meaning here about the pass/fail output of the test payload > > executed by the task. > > > > It should be kept under a ``tasks_finished`` structure or be marked as > > ``FINISHED`` or a similar name that links to this definition. > > > > Task has been interrupted > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > This means that the task has started, but has not finished and it's > > past due. > > > > It should be kept under a ``tasks_interrupted`` structure or be marked > > as ``INTERRUPTED`` or a similar name that links to this definition. > > > > Task workflow > > ------------- > > I might be wrong, but this section and the previous one looks very close > to a "life cycle/state machine" of a scheduler system. Something like > the HPC job scheduler life-cycle, like Slurm, Condor, and many other > systems. For sure this is a great good start. > It definitely has a lot of similarities, but it *must* be simpler than those, as we want a trivial "bootstrap" (really, meaning *no* boostrap). That is, users should continue to be able to running tests with Avocado as they've done before (avocado list ..., avocado run ...). > I do have some suggestions here, and questions about each state that you > designed, maybe I would avoid one or two states, but looks like this > should be part of a proper blueprint, maybe the same blueprint where we > can discuss the "Task Scheduler" implementation along with the "Task > Life Cycle". > Got it. If you look at my initial implementation of this idea, you'll find that it was a lot simpler than this, with less states. Then, to avoid not addressing all questions, I decided to be more verbose here. Maybe we'll meet in the middle. > > Summary > > ======= > > > > This proposal contains a number of items that can become GitHub issues > > at this stage. It also contains a general explanation of what I believe > > are the crucial missing features to make the N(ext) Runner implementation > > available to the general public. > > > > Feedback is highly appreciated, and it's expected that this document will > > evolve into a better version, and possibly become a formal Blue Print. > > +1 for the blueprint on the 'Task Scheduler/Life Cycle". ;) > > As I said before, would be nice to have this big picture that you > describe here as issues/milestones/projects/columns on GH so we can > split this effort and have a better understanding of the this progress. > Hopefully the issues and labels we have now are enough. > Again, thanks for dumping your ideas here for discussion. > > -- > Beraldo > Thanks for the feedback, and let's follow up with the blueprint discussion. - Cleber.
signature.asc
Description: PGP signature