This is a great recap/proposal of some discussions @mhrivnak and I have had before. We're pretty similar in our approach to resolving this. I want to restate the pain point motivating it also. The problem: When a worker goes offline, the tasks that are in its queue get cancelled which is surprising to the user. This bug is tracked as issue #489 [0] and its one of the biggest pain points with the tasking system. Pulp3 is also affected by this bug.
I think a mini-version of this solution would also resolve issue #489. Specifically we could continue to use the "Dedicated queue" feature of celery, but we could add a new "recovery" workflow where a queue with work in it is orphaned due to a worker being stopped/killed and you have to route that work to another worker. Either way though, I think we could easily agree on a plan to fix this that would work. My main question is: when? [0]: https://pulp.plan.io/issues/489 -Brian On Mon, Oct 30, 2017 at 6:26 PM, Michael Hrivnak <[email protected]> wrote: > While it's on my mind, I just want to get this idea out to others for > future consideration. I do not think we should necessarily make any changes > to Pulp 3.0 based on this. > > Setup > ------- > > What is a Pulp worker? We tend to think of them as a process, or pair of > processes in parent-child relationship, with a number from 0-7 (or a higher > number if you configure Pulp as such). Each worker has a systemd unit file > and a queue. We know how many should be running and monitor them. If you > have multiple machines, each machine has a defined set of numbered workers. > > Pulp tracks each worker in the database. Why? For resource reservation. > For any given resource (usually a repository), all not-complete tasks are > assigned to the same worker so they go into one FIFO queue, which preserves > order-of-operation. Having one worker per queue guarantees that no more > than one task will run at a time for a given resource. > > Difficulty arises when we deal with workers going offline. What if a > worker dies unexpectedly and leaves its queue behind, orphaned? How can we > quiesce a worker (stop assigning it work) so it can be taken offline > gracefully? In a clustered environment, such as Pulp running in Kubernetes > or OpenShift, users will expect the ability to scale the number of workers > up and down, and so we'll need to address these challenges. The > containerized-Pulp use case helps clarify, I think, the role of workers vs. > queues. > > Pitch > ------ > > Workers are stateless processes. They are a commodity that should come and > go just as easily as the processes that handle http requests. The only > long-term state associated with a worker is its queue, and I propose that > we (eventually) stop defining a queue based on which worker created it. > > Today: a worker starts, creates a queue for itself, and informs Pulp it is > ready to receive work in that queue. > > Future: a worker starts, the worker informs Pulp it is ready, and Pulp > tells the worker which queues it should work from. > > Queues become the first-class resource in Pulp that tasks are assigned to. > Pulp monitors workers to ensure that each queue is assigned to exactly one > healthy worker, but it does not care as much which one. > > Use Cases > -------------- > > If a worker process dies and a new one starts up, Pulp can assign the > orphaned queue to the new worker. > > If a worker dies (gracefully or not) and a new one does not show up, Pulp > can assign the orphaned queue to another worker, which would do double-duty > until one of the queues was emptied, at which point Pulp could choose to > delete that queue. > > If a new additional worker shows up, Pulp could potentially assign it only > to the general "celery" queue. Based on some policy, a new > resource-reserving queue could optionally be created in the future, only > if/when it was needed, and assigned to that worker. > > Pulp as a clustered app would own and manage a pool of queues. The number > of queues would be influenced by user settings (maybe a min and max), how > much work is being requested at any given time, and how many processes are > available to do work. The cluster would manage the full lifecycle of each > queue. > > Pulp would monitor a pool of workers who are effectively anonymous. They > would have no meaningful identity from a scheduling standpoint. They come > and go through outside influence, but the application would make no effort > to manage their lifecycle. Pulp would only tell each worker which queues it > should work from. > > Summary > ----------- > > Details aside, the important points are: > > - Focus on the queue as the owner of state. > - For purposes of scheduling tasks, worker processes are anonymous. > - Pulp manages a pool of queues, monitors a pool of workers, and assigns > queues to workers as workers come and go. > > Thoughts? Would it help to elaborate with concrete examples? Maybe a > metaphor... > > Black Friday > --------------- > > Extending our familiar Black Friday metaphor... starting with a re-cap. > > Customers at a retail store are standing in one long line to check out. A > traffic-cop at the head of the line tells each person which register to go > to, based on some rules. (each register represents a worker's queue). > > This proposal is that we should think about the line at each register > separately from the cashier. (the line is a queue, and the cashier is a > worker process) One cashier coming on duty can take over another's register > so they can go on break. If a cashier has to close their register to go on > break, the cashier next-door might run back-and-forth between two registers > for a while until one of the lines is empty. An entire shift of 16 fresh > cashiers might show up and relieve the previous shift. (similar to > migrating worker processes from one machine in a cluster to another; the > queues stay the same, but they get matched with new anonymous workers) > > -- > > Michael Hrivnak > > Principal Software Engineer, RHCE > > Red Hat > > _______________________________________________ > Pulp-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/pulp-dev > >
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
