I created an RFC based on the discussion above: https://github.com/apache/couchdb-documentation/pull/617
-Nick On Sun, Jan 17, 2021 at 12:22 AM Nick Vatamaniuc <vatam...@gmail.com> wrote: > > On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet <woh...@apache.org> wrote: > > > > On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right. > > There are two places where job priority comes > > > into play: one is when jobs are started, and then the other is when > > > they are stopped. When starting jobs, both continuous and one-shot > > > (normal) jobs are at the same priority. However when jobs are stopped, > > > we try to stop only the continuous jobs. The only time we'd stop > > > one-shot jobs is when the operator manually lowers the max_jobs config > > > limit. This means there is an imbalance between the starting and > > > stopping algorithms and eventually a large number of one-shot > > > replications could completely crowd out and replace all the continuous > > > ones. Continuous ones would periodically stop, and if the new ones > > > started are one-shot, those would slowly accumulate over time. > > > > I do think this is a concern, but I also think it can be addressed > > separately from this proposal. I don't have a good answer re: fixing it. > > > > > The idea is to still guarantee fairness with hopefully an easy to > > > understand algorithm. For the 10 different replicator dbs we still > > > round-robin among them, and pick the jobs with the lowest start time > > > from each in turn, then, do the same thing next cycle, until we have > > > enough jobs picked out. So for example if we had this setup: > > > > > > db1: j4 > > > db2: j3, j5 > > > db3: j1, j2 > > > > > > Where the index for the jobs is the start time (j1 was last stated > > > before j2), we'd pick them as j1, j3, j4 during the first round-robin > > > selection loop, then we'd be left with > > > > > > db1 : > > > db2 : j5 > > > db3 : j2 > > > > > > Next we'd pick j2 and j5 and so on. So overall we'd get > > > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be: > > > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2: > > > > > > db1: j1, j2, j3 - j41, j43 - j99 > > > db2: j42, j100 > > > > > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The > > > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make > > > sure all the dbs get a fair chance to run their jobs. If there is one > > > db only, the algorithm doesn't change from the current one. However, > > > one thing worrying me here is this implies either a lot of sorting or > > > keeping a scheduler queue per db, so things could end up being quite > > > expensive performance-wise. > > > > So this doesn't really let us do any sort of priority queue, which is > > unfortunate. You can take your higher priority tasks and move them > > outside of the db that has all of your background tasks in it, but if > > you have > max_jobs/num_replicator_dbs in that separate db, some will > > still get ignored. Fair share would provide a way to ensure starvation > > doesn't occur. > > > > I think my idea would just be a simpler version of fair share where > the shares are always equal amongst the dbs, and without the need to > explicitly handle the concept of a "share". Jobs would just start and > stop based on a db round-robin schedule, then based on start times. So > the few jobs in a separate db would get a higher priority compared to > the priority of a db with lots of jobs in it. But as you said, that > would not be an explicit priority mechanism so might be awkward to use > it to boost priorities. > > > (Also consider the situation where num_replicator_dbs > max_jobs...) > > > > Was this in regards to taking the higher priority tasks and moving > them outside of the db? In the original idea we should handle the case > where num_replicator_dbs > max_jobs: during each scheduling cycle, the > dbs with jobs with lower start times at the front would get picked > first. But it would mean that some of the dbs don't get scheduled > during a scheduling cycle, they might have to wait for the next one or > the one after, etc. But that's unavoidable anyway. The key is that it > would still be fair across multiple scheduling cycles. > > > >> Would a fair-share scheduler [1] make sense here? We can't borrow many > > >> of the OS-level scheduling ideas because we don't have the same insight > > >> into the runtime characteristics of our jobs, but fair share might be a > > >> good match. > > >> > > >> Here's one way it could be implemented: > > >> > > >> * Each replicator DB could represent a logical grouping of jobs, e.g. > > >> background continuous backup jobs, one-shot migration jobs, > > >> high-priority user requests, etc. > > >> > > >> * An INI-level config could determine how these groups are handled: > > >> > > >> * 100% fair (default) -- total available job slots are divided by > > >> the number of replicator DBs, then allocated round-robin within > > >> that DB. The difference from today's setup is that each DB gets > > >> its own round-robin scheduler. I *think* this might be what Nick > > >> was suggesting, but I can't be 100% sure. > > >> > > > > > > Yeah, I think that's what I had in mind too but just having an equal > > > priority between them. I like the idea of being able to tweak > > > priorities for replicator dbs. Thanks for the reference to the OS > > > scheduler and the paper! I looked at the Multiqueue Feedback Scheduler > > > http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was > > > used in Linux previously and is still used by some operating system > > > for CPU scheduling. The reason I looked at it, is that initially I > > > contemplated a static user-configurable priorities, but realized we'd > > > need something like the Multiqueue Feedback Scheduler to dynamically > > > tweak job priority to avoid the case where some were permanently > > > starved. > > > > Yeah, I did a full review of mlfq and the Linux scehdulers before > > suggesting Fair Share... I think the design goal of "is easier to > > intuit" is an important one here, and I found it easier to reason about > > my variant than yours. But I'm not attached to my idea, let's do what is > > the easiest to explain and meets the other design goals. > > That makes sense. Thank you for the help with the design. I like the > fair share idea and will give that a try first. > > > >> [Fair share worked example] > > > > > > I think this might work, but could also starve lower priority jobs. I > > > was thinking when we have a high/_replicator:99 db which is fed newer > > > jobs continuously, and the low/_replicator:1 which never gets a chance > > > to run if its share gets rounded down to 0. I think one way to avoid > > > that would be to ensure we pick at least one job from each db, and > > > then apply the proportional shares, or just make it round down 1 as a > > > minimum... > > > > I would consider that user error in configuration, which I guess is > > similar to the corner case in your proposal of having >max_jobs > > replicator DBs. > > > > But yeah, min(1, <formula previously presented>) solves that problem > > nicely too. > > > > > The paper and your proposal presents an interesting way of > > > "materializing" the proportional shares as explicit quantity, and the > > > current and my scheme don't have that part (there is no > > > "scheduler_share" variable and we just pick the jobs in the order of > > > start time). But I like the idea and it's worth giving it a try. It > > > might have to be one of those things, that once we start playing with > > > the code, some things might become more obvious and easy to do and > > > some too complicated. > > > > Sure, let's try it! > > > > > Thank you, Joan, for the great suggestions and for taking a look! > > > > My pleasure! > > > > If you need a use case that is hard to reason about, consider a > > resource-constrained cluster (where # of replications vastly outstrips > > resources avialable) and you need to ensure fairness across a very large > > number of continuous and one-shot replications. Within these the > > one-shot are considered higher priority, but they should not completely > > drown out the continuous ones. > > This is a tricky one, yeah. We've done this by design, but it has its > issues. An interesting thing to think about is with a fair share > configurability, if a user assigned all the one-shot replications to a > low/_replicator:10 db, and then after those jobs are started, the user > adds lots continuous replications to their default _replicator db. > Does it mean we would break the "snapshotting" promise and stop the > one-shot replications to make room for the now higher priority > continuous ones, or, do we keep them running. I would think we should > probably stick to keeping them running, and just rely on them starting > at a slower rate perhaps, since their priority is :10 vs say :90 in > the default db... > > > I can give more detail on the example if it sounds interesting to you - > > this was a client system I worked on for a while (after it was too late > > to improve the design) and we never were able to get CouchDB's > > replication scheduler to do the right thing. Everything got replaced by > > an external scheduler that knew what max_jobs was, and did everything as > > one-shot replications. > > Ah that's sad to hear. Sounds like this scheduling improvement is long > overdue. > > > > > -Joan > > -Nick