On Thu, Jan 14, 2021 at 4:51 PM Joan Touzet <woh...@apache.org> wrote: > > On 2021-01-12 12:55 p.m., Nick Vatamaniuc wrote: > > Hello everyone > > Hi (Dr.) Nick! ;)
"One quick application of Spiffy and we got ourselves a better scheduler!" I had to look up the Simpson reference as I had completely forgotten about him :-) > > I wanted to see what we thought about adding a scheduling improvement > > to the replicator. Specifically adding round-robin fairness amongst > > different _replicator dbs > > We've needed some sort of priority system for a while -- critically, in > some cases, our failure to offer something has forced my hand into > recommending use of external replication scheduling techniques. I'd > never thought of using multiple databases as a way of handling this > problem, though. > > > Currently, the scheduler runs all the jobs in the system fairly. It > > does it by using the jobs' "last started" timestamp to select which > > jobs to stop and start next. So each scheduling cycle, the jobs which > > ran the longest (have the lowest start time) get stopped, then, from > > the list of pending jobs we also pick those with the lowest start > > times (since they waited the longest) to run next. > > There is also some difference between continuous and one-shot > replication jobs, isn't there? I seem to recall one-shot jobs get > priority over continuous ones. You're absolutely right. There are two places where job priority comes into play: one is when jobs are started, and then the other is when they are stopped. When starting jobs, both continuous and one-shot (normal) jobs are at the same priority. However when jobs are stopped, we try to stop only the continuous jobs. The only time we'd stop one-shot jobs is when the operator manually lowers the max_jobs config limit. This means there is an imbalance between the starting and stopping algorithms and eventually a large number of one-shot replications could completely crowd out and replace all the continuous ones. Continuous ones would periodically stop, and if the new ones started are one-shot, those would slowly accumulate over time. > > However, this algorithm can be unfair among replication jobs from > > different _replicator dbs. For example, if max_jobs=10 and one > > _replicator db has a 1000 jobs and another has only one doc then, > > that one job might have to wait for quite a while to get its turn to > > run. If we picked fairly amongst _replicator dbs, then for example, > > the one job would get at least one of the 10 max_jobs run "slots", > > and the rest of the 9 slots would go to the replicator db with 1000 > > docs. If there would be 11 _replicator dbs, then a docs from the ones > > with the lowest start times amongst them would be picked first. > > That does sound unfair. I wasn't aware of this "gotcha" with multiple > replicator DBs (in the field I rarely see >1 replicator DB) but it > intuitively makes sense - all replicator DBs are unioned in the > scheduler's queue and treated like a single, big DB. (There is basically > no advantage to multiple _replicator DBs today then, other than perhaps > security / obscurity reasons, is there?) Exactly, they are unioned without any regard to how they ended up in the scheduler. Today we could use multiple db to group replication jobs for security or for management, like you suggested. You can delete a whole _replicator db and stop a whole group of jobs at once. Or you could replicate the _replicator dbs themselves around even! > > This feature would also allow running some quick replication jobs, > > when there is already a full queue main _replicator db jobs by > > creating a "quickjobs/_replicator" db, and insert these new jobs > > there. With this new scheme they would start running right away even > > if the queue is full with jobs from the main _replicator db. Another, > > related idea I had, was to add per job user-configurable priorities: > > high/default/low or numeric. However, that scheme is not as good as > > it could lead to permanent starvation of jobs while the round-robin > > db scheduling scheme still guarantees all jobs will eventually run. > > I think this proposal can be improved. I like the idea of leveraging > multiple replicator DBs to provide some sort of priority tweak, but I am > not sure I can predict what will happen in the case of e.g. 10 different > replicator DBs fighting with each other in the scheduler as you've laid > it out. The idea is to still guarantee fairness with hopefully an easy to understand algorithm. For the 10 different replicator dbs we still round-robin among them, and pick the jobs with the lowest start time from each in turn, then, do the same thing next cycle, until we have enough jobs picked out. So for example if we had this setup: db1: j4 db2: j3, j5 db3: j1, j2 Where the index for the jobs is the start time (j1 was last stated before j2), we'd pick them as j1, j3, j4 during the first round-robin selection loop, then we'd be left with db1 : db2 : j5 db3 : j2 Next we'd pick j2 and j5 and so on. So overall we'd get j1,j3,j4,j2,j5. If we look at the dbs they came from it would be: db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2: db1: j1, j2, j3 - j41, j43 - j99 db2: j42, j100 The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make sure all the dbs get a fair chance to run their jobs. If there is one db only, the algorithm doesn't change from the current one. However, one thing worrying me here is this implies either a lot of sorting or keeping a scheduler queue per db, so things could end up being quite expensive performance-wise. > Would a fair-share scheduler [1] make sense here? We can't borrow many > of the OS-level scheduling ideas because we don't have the same insight > into the runtime characteristics of our jobs, but fair share might be a > good match. > > Here's one way it could be implemented: > > * Each replicator DB could represent a logical grouping of jobs, e.g. > background continuous backup jobs, one-shot migration jobs, > high-priority user requests, etc. > > * An INI-level config could determine how these groups are handled: > > * 100% fair (default) -- total available job slots are divided by > the number of replicator DBs, then allocated round-robin within > that DB. The difference from today's setup is that each DB gets > its own round-robin scheduler. I *think* this might be what Nick > was suggesting, but I can't be 100% sure. > Yeah, I think that's what I had in mind too but just having an equal priority between them. I like the idea of being able to tweak priorities for replicator dbs. Thanks for the reference to the OS scheduler and the paper! I looked at the Multiqueue Feedback Scheduler http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was used in Linux previously and is still used by some operating system for CPU scheduling. The reason I looked at it, is that initially I contemplated a static user-configurable priorities, but realized we'd need something like the Multiqueue Feedback Scheduler to dynamically tweak job priority to avoid the case where some were permanently starved. > * Specific priority -- A list of DBs and their fixed percentages. > Example: _replicator:30,high/_replicator:50,low/_replicator:20 > > If a configured database is not present, or has no active jobs in > it, its fixed proportion of cycles are "given up" to the other > replicator DBs. > > * Jobs are then allocated based on these breakdowns, and we continue > to use round-robin within each database to maintain the current > approach. This means no visible change for upgrading users with a > single _replicator DB. > > > Example, with the numbers picked to make the math turn out nice: > > ``` > [replicator] > db_priorities = _replicator:30,high/_replicator:50,low/_replicator:20 > max_jobs = 100 > ``` > > The % allocation of jobs across DBs is, by default: > > _replicator: 30 / (30+20+50) = 30% > high/_replicator: 20 / (30+20+50) = 20% > low/_replicator: 50 / (30+20+50) = 50% > > First, put 100 jobs into _replicator. Because the other two DBs have no > jobs, all 100 free jobs are allocated to those jobs: > > _replicator: 30 / (30) = 100% * 100 jobs = 100 > > Now, we add another 100 jobs into the high DB. Since low is empty, we > recalculate calculate the distribution ratios: > > _replicator: 30 / (30+50) = 37.5% * 100 jobs max = 37 jobs > high/_replicator: 50 / (30+50) = 62.5% * 100 jobs max = 62 jobs > > So 62 _replicator jobs are halted, and 62 high/_replicator jobs are > started. (The 'lost' fractional job could be dynamically given to either > DB as desired.) > > Finally, we add a further 100 jobs into the low DB: > > _replicator: 30 / (30+20+50) = 30% * 100 jobs max = 30 jobs > high/_replicator: 50 / (30+20+50) = 50% * 100 jobs max = 50 jobs > low/_replicator: 20 / (30+20+50) = 20% * 100 jobs max = 20 jobs > > Now the allocation exactly matches the chosen percentages. > > Have I missed any starvation scenarios? > I think this might work, but could also starve lower priority jobs. I was thinking when we have a high/_replicator:99 db which is fed newer jobs continuously, and the low/_replicator:1 which never gets a chance to run if its share gets rounded down to 0. I think one way to avoid that would be to ensure we pick at least one job from each db, and then apply the proportional shares, or just make it round down 1 as a minimum... The paper and your proposal presents an interesting way of "materializing" the proportional shares as explicit quantity, and the current and my scheme don't have that part (there is no "scheduler_share" variable and we just pick the jobs in the order of start time). But I like the idea and it's worth giving it a try. It might have to be one of those things, that once we start playing with the code, some things might become more obvious and easy to do and some too complicated. > > Would this feature be useful to have? I was thinking of giving it a > > try on a branch. I suspect implementing this for 3.x might be a tad > > simpler since scheduling is done in memory independently on each > > separate node so was thinking of starting there first. For main > > (future 4.x) we might require some coordination state to live in FDB > > and would have to possibly patch up couch_jobs to know about > > priorities. > > It does sound worth investigating, and on 3.x to start. > > > > > Cheers, -Nick > > > > [1] The original 1988 paper by Judy Kay and Piers Lauder is a good, > short read. I especially like the Design Goals as listed on page 52 (9). > https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf Thank you, Joan, for the great suggestions and for taking a look!