On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet <woh...@apache.org> wrote: > > On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right. > There are two places where job priority comes > > into play: one is when jobs are started, and then the other is when > > they are stopped. When starting jobs, both continuous and one-shot > > (normal) jobs are at the same priority. However when jobs are stopped, > > we try to stop only the continuous jobs. The only time we'd stop > > one-shot jobs is when the operator manually lowers the max_jobs config > > limit. This means there is an imbalance between the starting and > > stopping algorithms and eventually a large number of one-shot > > replications could completely crowd out and replace all the continuous > > ones. Continuous ones would periodically stop, and if the new ones > > started are one-shot, those would slowly accumulate over time. > > I do think this is a concern, but I also think it can be addressed > separately from this proposal. I don't have a good answer re: fixing it. > > > The idea is to still guarantee fairness with hopefully an easy to > > understand algorithm. For the 10 different replicator dbs we still > > round-robin among them, and pick the jobs with the lowest start time > > from each in turn, then, do the same thing next cycle, until we have > > enough jobs picked out. So for example if we had this setup: > > > > db1: j4 > > db2: j3, j5 > > db3: j1, j2 > > > > Where the index for the jobs is the start time (j1 was last stated > > before j2), we'd pick them as j1, j3, j4 during the first round-robin > > selection loop, then we'd be left with > > > > db1 : > > db2 : j5 > > db3 : j2 > > > > Next we'd pick j2 and j5 and so on. So overall we'd get > > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be: > > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2: > > > > db1: j1, j2, j3 - j41, j43 - j99 > > db2: j42, j100 > > > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The > > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make > > sure all the dbs get a fair chance to run their jobs. If there is one > > db only, the algorithm doesn't change from the current one. However, > > one thing worrying me here is this implies either a lot of sorting or > > keeping a scheduler queue per db, so things could end up being quite > > expensive performance-wise. > > So this doesn't really let us do any sort of priority queue, which is > unfortunate. You can take your higher priority tasks and move them > outside of the db that has all of your background tasks in it, but if > you have > max_jobs/num_replicator_dbs in that separate db, some will > still get ignored. Fair share would provide a way to ensure starvation > doesn't occur. >
I think my idea would just be a simpler version of fair share where the shares are always equal amongst the dbs, and without the need to explicitly handle the concept of a "share". Jobs would just start and stop based on a db round-robin schedule, then based on start times. So the few jobs in a separate db would get a higher priority compared to the priority of a db with lots of jobs in it. But as you said, that would not be an explicit priority mechanism so might be awkward to use it to boost priorities. > (Also consider the situation where num_replicator_dbs > max_jobs...) > Was this in regards to taking the higher priority tasks and moving them outside of the db? In the original idea we should handle the case where num_replicator_dbs > max_jobs: during each scheduling cycle, the dbs with jobs with lower start times at the front would get picked first. But it would mean that some of the dbs don't get scheduled during a scheduling cycle, they might have to wait for the next one or the one after, etc. But that's unavoidable anyway. The key is that it would still be fair across multiple scheduling cycles. > >> Would a fair-share scheduler [1] make sense here? We can't borrow many > >> of the OS-level scheduling ideas because we don't have the same insight > >> into the runtime characteristics of our jobs, but fair share might be a > >> good match. > >> > >> Here's one way it could be implemented: > >> > >> * Each replicator DB could represent a logical grouping of jobs, e.g. > >> background continuous backup jobs, one-shot migration jobs, > >> high-priority user requests, etc. > >> > >> * An INI-level config could determine how these groups are handled: > >> > >> * 100% fair (default) -- total available job slots are divided by > >> the number of replicator DBs, then allocated round-robin within > >> that DB. The difference from today's setup is that each DB gets > >> its own round-robin scheduler. I *think* this might be what Nick > >> was suggesting, but I can't be 100% sure. > >> > > > > Yeah, I think that's what I had in mind too but just having an equal > > priority between them. I like the idea of being able to tweak > > priorities for replicator dbs. Thanks for the reference to the OS > > scheduler and the paper! I looked at the Multiqueue Feedback Scheduler > > http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was > > used in Linux previously and is still used by some operating system > > for CPU scheduling. The reason I looked at it, is that initially I > > contemplated a static user-configurable priorities, but realized we'd > > need something like the Multiqueue Feedback Scheduler to dynamically > > tweak job priority to avoid the case where some were permanently > > starved. > > Yeah, I did a full review of mlfq and the Linux scehdulers before > suggesting Fair Share... I think the design goal of "is easier to > intuit" is an important one here, and I found it easier to reason about > my variant than yours. But I'm not attached to my idea, let's do what is > the easiest to explain and meets the other design goals. That makes sense. Thank you for the help with the design. I like the fair share idea and will give that a try first. > >> [Fair share worked example] > > > > I think this might work, but could also starve lower priority jobs. I > > was thinking when we have a high/_replicator:99 db which is fed newer > > jobs continuously, and the low/_replicator:1 which never gets a chance > > to run if its share gets rounded down to 0. I think one way to avoid > > that would be to ensure we pick at least one job from each db, and > > then apply the proportional shares, or just make it round down 1 as a > > minimum... > > I would consider that user error in configuration, which I guess is > similar to the corner case in your proposal of having >max_jobs > replicator DBs. > > But yeah, min(1, <formula previously presented>) solves that problem > nicely too. > > > The paper and your proposal presents an interesting way of > > "materializing" the proportional shares as explicit quantity, and the > > current and my scheme don't have that part (there is no > > "scheduler_share" variable and we just pick the jobs in the order of > > start time). But I like the idea and it's worth giving it a try. It > > might have to be one of those things, that once we start playing with > > the code, some things might become more obvious and easy to do and > > some too complicated. > > Sure, let's try it! > > > Thank you, Joan, for the great suggestions and for taking a look! > > My pleasure! > > If you need a use case that is hard to reason about, consider a > resource-constrained cluster (where # of replications vastly outstrips > resources avialable) and you need to ensure fairness across a very large > number of continuous and one-shot replications. Within these the > one-shot are considered higher priority, but they should not completely > drown out the continuous ones. This is a tricky one, yeah. We've done this by design, but it has its issues. An interesting thing to think about is with a fair share configurability, if a user assigned all the one-shot replications to a low/_replicator:10 db, and then after those jobs are started, the user adds lots continuous replications to their default _replicator db. Does it mean we would break the "snapshotting" promise and stop the one-shot replications to make room for the now higher priority continuous ones, or, do we keep them running. I would think we should probably stick to keeping them running, and just rely on them starting at a slower rate perhaps, since their priority is :10 vs say :90 in the default db... > I can give more detail on the example if it sounds interesting to you - > this was a client system I worked on for a while (after it was too late > to improve the design) and we never were able to get CouchDB's > replication scheduler to do the right thing. Everything got replaced by > an external scheduler that knew what max_jobs was, and did everything as > one-shot replications. Ah that's sad to hear. Sounds like this scheduling improvement is long overdue. > > -Joan -Nick