Re: [DISCUSS] Replicator scheduler improvement
I created an RFC based on the discussion above: https://github.com/apache/couchdb-documentation/pull/617 -Nick On Sun, Jan 17, 2021 at 12:22 AM Nick Vatamaniuc wrote: > > On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet wrote: > > > > On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right. > > There are two places where job priority comes > > > into play: one is when jobs are started, and then the other is when > > > they are stopped. When starting jobs, both continuous and one-shot > > > (normal) jobs are at the same priority. However when jobs are stopped, > > > we try to stop only the continuous jobs. The only time we'd stop > > > one-shot jobs is when the operator manually lowers the max_jobs config > > > limit. This means there is an imbalance between the starting and > > > stopping algorithms and eventually a large number of one-shot > > > replications could completely crowd out and replace all the continuous > > > ones. Continuous ones would periodically stop, and if the new ones > > > started are one-shot, those would slowly accumulate over time. > > > > I do think this is a concern, but I also think it can be addressed > > separately from this proposal. I don't have a good answer re: fixing it. > > > > > The idea is to still guarantee fairness with hopefully an easy to > > > understand algorithm. For the 10 different replicator dbs we still > > > round-robin among them, and pick the jobs with the lowest start time > > > from each in turn, then, do the same thing next cycle, until we have > > > enough jobs picked out. So for example if we had this setup: > > > > > > db1: j4 > > > db2: j3, j5 > > > db3: j1, j2 > > > > > > Where the index for the jobs is the start time (j1 was last stated > > > before j2), we'd pick them as j1, j3, j4 during the first round-robin > > > selection loop, then we'd be left with > > > > > > db1 : > > > db2 : j5 > > > db3 : j2 > > > > > > Next we'd pick j2 and j5 and so on. So overall we'd get > > > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be: > > > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2: > > > > > > db1: j1, j2, j3 - j41, j43 - j99 > > > db2: j42, j100 > > > > > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The > > > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make > > > sure all the dbs get a fair chance to run their jobs. If there is one > > > db only, the algorithm doesn't change from the current one. However, > > > one thing worrying me here is this implies either a lot of sorting or > > > keeping a scheduler queue per db, so things could end up being quite > > > expensive performance-wise. > > > > So this doesn't really let us do any sort of priority queue, which is > > unfortunate. You can take your higher priority tasks and move them > > outside of the db that has all of your background tasks in it, but if > > you have > max_jobs/num_replicator_dbs in that separate db, some will > > still get ignored. Fair share would provide a way to ensure starvation > > doesn't occur. > > > > I think my idea would just be a simpler version of fair share where > the shares are always equal amongst the dbs, and without the need to > explicitly handle the concept of a "share". Jobs would just start and > stop based on a db round-robin schedule, then based on start times. So > the few jobs in a separate db would get a higher priority compared to > the priority of a db with lots of jobs in it. But as you said, that > would not be an explicit priority mechanism so might be awkward to use > it to boost priorities. > > > (Also consider the situation where num_replicator_dbs > max_jobs...) > > > > Was this in regards to taking the higher priority tasks and moving > them outside of the db? In the original idea we should handle the case > where num_replicator_dbs > max_jobs: during each scheduling cycle, the > dbs with jobs with lower start times at the front would get picked > first. But it would mean that some of the dbs don't get scheduled > during a scheduling cycle, they might have to wait for the next one or > the one after, etc. But that's unavoidable anyway. The key is that it > would still be fair across multiple scheduling cycles. > > > >> Would a fair-share scheduler [1] make sense here? We can't borrow many > > >> of the OS-level scheduling ideas because we don't have the same insight > > >> into the runtime characteristics of our jobs, but fair share might be a > > >> good match. > > >> > > >> Here's one way it could be implemented: > > >> > > >> * Each replicator DB could represent a logical grouping of jobs, e.g. > > >>background continuous backup jobs, one-shot migration jobs, > > >>high-priority user requests, etc. > > >> > > >> * An INI-level config could determine how these groups are handled: > > >> > > >>* 100% fair (default) -- total available job slots are divided by > > >> the number of replicator DBs, then allocated round-robin within > > >> th
Re: [DISCUSS] Replicator scheduler improvement
On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet wrote: > > On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right. > There are two places where job priority comes > > into play: one is when jobs are started, and then the other is when > > they are stopped. When starting jobs, both continuous and one-shot > > (normal) jobs are at the same priority. However when jobs are stopped, > > we try to stop only the continuous jobs. The only time we'd stop > > one-shot jobs is when the operator manually lowers the max_jobs config > > limit. This means there is an imbalance between the starting and > > stopping algorithms and eventually a large number of one-shot > > replications could completely crowd out and replace all the continuous > > ones. Continuous ones would periodically stop, and if the new ones > > started are one-shot, those would slowly accumulate over time. > > I do think this is a concern, but I also think it can be addressed > separately from this proposal. I don't have a good answer re: fixing it. > > > The idea is to still guarantee fairness with hopefully an easy to > > understand algorithm. For the 10 different replicator dbs we still > > round-robin among them, and pick the jobs with the lowest start time > > from each in turn, then, do the same thing next cycle, until we have > > enough jobs picked out. So for example if we had this setup: > > > > db1: j4 > > db2: j3, j5 > > db3: j1, j2 > > > > Where the index for the jobs is the start time (j1 was last stated > > before j2), we'd pick them as j1, j3, j4 during the first round-robin > > selection loop, then we'd be left with > > > > db1 : > > db2 : j5 > > db3 : j2 > > > > Next we'd pick j2 and j5 and so on. So overall we'd get > > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be: > > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2: > > > > db1: j1, j2, j3 - j41, j43 - j99 > > db2: j42, j100 > > > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The > > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make > > sure all the dbs get a fair chance to run their jobs. If there is one > > db only, the algorithm doesn't change from the current one. However, > > one thing worrying me here is this implies either a lot of sorting or > > keeping a scheduler queue per db, so things could end up being quite > > expensive performance-wise. > > So this doesn't really let us do any sort of priority queue, which is > unfortunate. You can take your higher priority tasks and move them > outside of the db that has all of your background tasks in it, but if > you have > max_jobs/num_replicator_dbs in that separate db, some will > still get ignored. Fair share would provide a way to ensure starvation > doesn't occur. > I think my idea would just be a simpler version of fair share where the shares are always equal amongst the dbs, and without the need to explicitly handle the concept of a "share". Jobs would just start and stop based on a db round-robin schedule, then based on start times. So the few jobs in a separate db would get a higher priority compared to the priority of a db with lots of jobs in it. But as you said, that would not be an explicit priority mechanism so might be awkward to use it to boost priorities. > (Also consider the situation where num_replicator_dbs > max_jobs...) > Was this in regards to taking the higher priority tasks and moving them outside of the db? In the original idea we should handle the case where num_replicator_dbs > max_jobs: during each scheduling cycle, the dbs with jobs with lower start times at the front would get picked first. But it would mean that some of the dbs don't get scheduled during a scheduling cycle, they might have to wait for the next one or the one after, etc. But that's unavoidable anyway. The key is that it would still be fair across multiple scheduling cycles. > >> Would a fair-share scheduler [1] make sense here? We can't borrow many > >> of the OS-level scheduling ideas because we don't have the same insight > >> into the runtime characteristics of our jobs, but fair share might be a > >> good match. > >> > >> Here's one way it could be implemented: > >> > >> * Each replicator DB could represent a logical grouping of jobs, e.g. > >>background continuous backup jobs, one-shot migration jobs, > >>high-priority user requests, etc. > >> > >> * An INI-level config could determine how these groups are handled: > >> > >>* 100% fair (default) -- total available job slots are divided by > >> the number of replicator DBs, then allocated round-robin within > >> that DB. The difference from today's setup is that each DB gets > >> its own round-robin scheduler. I *think* this might be what Nick > >> was suggesting, but I can't be 100% sure. > >> > > > > Yeah, I think that's what I had in mind too but just having an equal > > priority between them. I like the idea of being able to tweak > > priorities for repl
Re: [DISCUSS] Replicator scheduler improvement
On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right. There are two places where job priority comes > into play: one is when jobs are started, and then the other is when > they are stopped. When starting jobs, both continuous and one-shot > (normal) jobs are at the same priority. However when jobs are stopped, > we try to stop only the continuous jobs. The only time we'd stop > one-shot jobs is when the operator manually lowers the max_jobs config > limit. This means there is an imbalance between the starting and > stopping algorithms and eventually a large number of one-shot > replications could completely crowd out and replace all the continuous > ones. Continuous ones would periodically stop, and if the new ones > started are one-shot, those would slowly accumulate over time. I do think this is a concern, but I also think it can be addressed separately from this proposal. I don't have a good answer re: fixing it. > The idea is to still guarantee fairness with hopefully an easy to > understand algorithm. For the 10 different replicator dbs we still > round-robin among them, and pick the jobs with the lowest start time > from each in turn, then, do the same thing next cycle, until we have > enough jobs picked out. So for example if we had this setup: > > db1: j4 > db2: j3, j5 > db3: j1, j2 > > Where the index for the jobs is the start time (j1 was last stated > before j2), we'd pick them as j1, j3, j4 during the first round-robin > selection loop, then we'd be left with > > db1 : > db2 : j5 > db3 : j2 > > Next we'd pick j2 and j5 and so on. So overall we'd get > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be: > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2: > > db1: j1, j2, j3 - j41, j43 - j99 > db2: j42, j100 > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make > sure all the dbs get a fair chance to run their jobs. If there is one > db only, the algorithm doesn't change from the current one. However, > one thing worrying me here is this implies either a lot of sorting or > keeping a scheduler queue per db, so things could end up being quite > expensive performance-wise. So this doesn't really let us do any sort of priority queue, which is unfortunate. You can take your higher priority tasks and move them outside of the db that has all of your background tasks in it, but if you have > max_jobs/num_replicator_dbs in that separate db, some will still get ignored. Fair share would provide a way to ensure starvation doesn't occur. (Also consider the situation where num_replicator_dbs > max_jobs...) >> Would a fair-share scheduler [1] make sense here? We can't borrow many >> of the OS-level scheduling ideas because we don't have the same insight >> into the runtime characteristics of our jobs, but fair share might be a >> good match. >> >> Here's one way it could be implemented: >> >> * Each replicator DB could represent a logical grouping of jobs, e.g. >>background continuous backup jobs, one-shot migration jobs, >>high-priority user requests, etc. >> >> * An INI-level config could determine how these groups are handled: >> >>* 100% fair (default) -- total available job slots are divided by >> the number of replicator DBs, then allocated round-robin within >> that DB. The difference from today's setup is that each DB gets >> its own round-robin scheduler. I *think* this might be what Nick >> was suggesting, but I can't be 100% sure. >> > > Yeah, I think that's what I had in mind too but just having an equal > priority between them. I like the idea of being able to tweak > priorities for replicator dbs. Thanks for the reference to the OS > scheduler and the paper! I looked at the Multiqueue Feedback Scheduler > http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was > used in Linux previously and is still used by some operating system > for CPU scheduling. The reason I looked at it, is that initially I > contemplated a static user-configurable priorities, but realized we'd > need something like the Multiqueue Feedback Scheduler to dynamically > tweak job priority to avoid the case where some were permanently > starved. Yeah, I did a full review of mlfq and the Linux scehdulers before suggesting Fair Share... I think the design goal of "is easier to intuit" is an important one here, and I found it easier to reason about my variant than yours. But I'm not attached to my idea, let's do what is the easiest to explain and meets the other design goals. >> [Fair share worked example] > > I think this might work, but could also starve lower priority jobs. I > was thinking when we have a high/_replicator:99 db which is fed newer > jobs continuously, and the low/_replicator:1 which never gets a chance > to run if its share gets rounded down to 0. I think one way to avoid > that would be to ensure we pick at least one job
Re: [DISCUSS] Replicator scheduler improvement
On Thu, Jan 14, 2021 at 4:51 PM Joan Touzet wrote: > > On 2021-01-12 12:55 p.m., Nick Vatamaniuc wrote: > > Hello everyone > > Hi (Dr.) Nick! ;) "One quick application of Spiffy and we got ourselves a better scheduler!" I had to look up the Simpson reference as I had completely forgotten about him :-) > > I wanted to see what we thought about adding a scheduling improvement > > to the replicator. Specifically adding round-robin fairness amongst > > different _replicator dbs > > We've needed some sort of priority system for a while -- critically, in > some cases, our failure to offer something has forced my hand into > recommending use of external replication scheduling techniques. I'd > never thought of using multiple databases as a way of handling this > problem, though. > > > Currently, the scheduler runs all the jobs in the system fairly. It > > does it by using the jobs' "last started" timestamp to select which > > jobs to stop and start next. So each scheduling cycle, the jobs which > > ran the longest (have the lowest start time) get stopped, then, from > > the list of pending jobs we also pick those with the lowest start > > times (since they waited the longest) to run next. > > There is also some difference between continuous and one-shot > replication jobs, isn't there? I seem to recall one-shot jobs get > priority over continuous ones. You're absolutely right. There are two places where job priority comes into play: one is when jobs are started, and then the other is when they are stopped. When starting jobs, both continuous and one-shot (normal) jobs are at the same priority. However when jobs are stopped, we try to stop only the continuous jobs. The only time we'd stop one-shot jobs is when the operator manually lowers the max_jobs config limit. This means there is an imbalance between the starting and stopping algorithms and eventually a large number of one-shot replications could completely crowd out and replace all the continuous ones. Continuous ones would periodically stop, and if the new ones started are one-shot, those would slowly accumulate over time. > > However, this algorithm can be unfair among replication jobs from > > different _replicator dbs. For example, if max_jobs=10 and one > > _replicator db has a 1000 jobs and another has only one doc then, > > that one job might have to wait for quite a while to get its turn to > > run. If we picked fairly amongst _replicator dbs, then for example, > > the one job would get at least one of the 10 max_jobs run "slots", > > and the rest of the 9 slots would go to the replicator db with 1000 > > docs. If there would be 11 _replicator dbs, then a docs from the ones > > with the lowest start times amongst them would be picked first. > > That does sound unfair. I wasn't aware of this "gotcha" with multiple > replicator DBs (in the field I rarely see >1 replicator DB) but it > intuitively makes sense - all replicator DBs are unioned in the > scheduler's queue and treated like a single, big DB. (There is basically > no advantage to multiple _replicator DBs today then, other than perhaps > security / obscurity reasons, is there?) Exactly, they are unioned without any regard to how they ended up in the scheduler. Today we could use multiple db to group replication jobs for security or for management, like you suggested. You can delete a whole _replicator db and stop a whole group of jobs at once. Or you could replicate the _replicator dbs themselves around even! > > This feature would also allow running some quick replication jobs, > > when there is already a full queue main _replicator db jobs by > > creating a "quickjobs/_replicator" db, and insert these new jobs > > there. With this new scheme they would start running right away even > > if the queue is full with jobs from the main _replicator db. Another, > > related idea I had, was to add per job user-configurable priorities: > > high/default/low or numeric. However, that scheme is not as good as > > it could lead to permanent starvation of jobs while the round-robin > > db scheduling scheme still guarantees all jobs will eventually run. > > I think this proposal can be improved. I like the idea of leveraging > multiple replicator DBs to provide some sort of priority tweak, but I am > not sure I can predict what will happen in the case of e.g. 10 different > replicator DBs fighting with each other in the scheduler as you've laid > it out. The idea is to still guarantee fairness with hopefully an easy to understand algorithm. For the 10 different replicator dbs we still round-robin among them, and pick the jobs with the lowest start time from each in turn, then, do the same thing next cycle, until we have enough jobs picked out. So for example if we had this setup: db1: j4 db2: j3, j5 db3: j1, j2 Where the index for the jobs is the start time (j1 was last stated before j2), we'd pick them as j1, j3, j4 during the first round-robin selection loop, then we'd be left with db1 : db2
Re: [DISCUSS] Replicator scheduler improvement
On 2021-01-12 12:55 p.m., Nick Vatamaniuc wrote: Hello everyone Hi (Dr.) Nick! ;) I wanted to see what we thought about adding a scheduling improvement to the replicator. Specifically adding round-robin fairness amongst different _replicator dbs We've needed some sort of priority system for a while -- critically, in some cases, our failure to offer something has forced my hand into recommending use of external replication scheduling techniques. I'd never thought of using multiple databases as a way of handling this problem, though. Currently, the scheduler runs all the jobs in the system fairly. It does it by using the jobs' "last started" timestamp to select which jobs to stop and start next. So each scheduling cycle, the jobs which ran the longest (have the lowest start time) get stopped, then, from the list of pending jobs we also pick those with the lowest start times (since they waited the longest) to run next. There is also some difference between continuous and one-shot replication jobs, isn't there? I seem to recall one-shot jobs get priority over continuous ones. However, this algorithm can be unfair among replication jobs from different _replicator dbs. For example, if max_jobs=10 and one _replicator db has a 1000 jobs and another has only one doc then, that one job might have to wait for quite a while to get its turn to run. If we picked fairly amongst _replicator dbs, then for example, the one job would get at least one of the 10 max_jobs run "slots", and the rest of the 9 slots would go to the replicator db with 1000 docs. If there would be 11 _replicator dbs, then a docs from the ones with the lowest start times amongst them would be picked first. That does sound unfair. I wasn't aware of this "gotcha" with multiple replicator DBs (in the field I rarely see >1 replicator DB) but it intuitively makes sense - all replicator DBs are unioned in the scheduler's queue and treated like a single, big DB. (There is basically no advantage to multiple _replicator DBs today then, other than perhaps security / obscurity reasons, is there?) This feature would also allow running some quick replication jobs, when there is already a full queue main _replicator db jobs by creating a "quickjobs/_replicator" db, and insert these new jobs there. With this new scheme they would start running right away even if the queue is full with jobs from the main _replicator db. Another, related idea I had, was to add per job user-configurable priorities: high/default/low or numeric. However, that scheme is not as good as it could lead to permanent starvation of jobs while the round-robin db scheduling scheme still guarantees all jobs will eventually run. I think this proposal can be improved. I like the idea of leveraging multiple replicator DBs to provide some sort of priority tweak, but I am not sure I can predict what will happen in the case of e.g. 10 different replicator DBs fighting with each other in the scheduler as you've laid it out. Would a fair-share scheduler [1] make sense here? We can't borrow many of the OS-level scheduling ideas because we don't have the same insight into the runtime characteristics of our jobs, but fair share might be a good match. Here's one way it could be implemented: * Each replicator DB could represent a logical grouping of jobs, e.g. background continuous backup jobs, one-shot migration jobs, high-priority user requests, etc. * An INI-level config could determine how these groups are handled: * 100% fair (default) -- total available job slots are divided by the number of replicator DBs, then allocated round-robin within that DB. The difference from today's setup is that each DB gets its own round-robin scheduler. I *think* this might be what Nick was suggesting, but I can't be 100% sure. * Specific priority -- A list of DBs and their fixed percentages. Example: _replicator:30,high/_replicator:50,low/_replicator:20 If a configured database is not present, or has no active jobs in it, its fixed proportion of cycles are "given up" to the other replicator DBs. * Jobs are then allocated based on these breakdowns, and we continue to use round-robin within each database to maintain the current approach. This means no visible change for upgrading users with a single _replicator DB. Example, with the numbers picked to make the math turn out nice: ``` [replicator] db_priorities = _replicator:30,high/_replicator:50,low/_replicator:20 max_jobs = 100 ``` The % allocation of jobs across DBs is, by default: _replicator: 30 / (30+20+50) = 30% high/_replicator: 20 / (30+20+50) = 20% low/_replicator: 50 / (30+20+50) = 50% First, put 100 jobs into _replicator. Because the other two DBs have no jobs, all 100 free jobs are allocated to those jobs: _replicator: 30 / (30) = 100% * 100 jobs = 100 Now, we add another 100 jobs into the high DB. Since low is empty, we recalculate calculate the dist
[DISCUSS] Replicator scheduler improvement
Hello everyone I wanted to see what we thought about adding a scheduling improvement to the replicator. Specifically adding round-robin fairness amongst different _replicator dbs. Currently, the scheduler runs all the jobs in the system fairly. It does it by using the jobs' "last started" timestamp to select which jobs to stop and start next. So each scheduling cycle, the jobs which ran the longest (have the lowest start time) get stopped, then, from the list of pending jobs we also pick those with the lowest start times (since they waited the longest) to run next. However, this algorithm can be unfair among replication jobs from different _replicator dbs. For example, if max_jobs=10 and one _replicator db has a 1000 jobs and another has only one doc then, that one job might have to wait for quite a while to get its turn to run. If we picked fairly amongst _replicator dbs, then for example, the one job would get at least one of the 10 max_jobs run "slots", and the rest of the 9 slots would go to the replicator db with 1000 docs. If there would be 11 _replicator dbs, then a docs from the ones with the lowest start times amongst them would be picked first. This feature would also allow running some quick replication jobs, when there is already a full queue main _replicator db jobs by creating a "quickjobs/_replicator" db, and insert these new jobs there. With this new scheme they would start running right away even if the queue is full with jobs from the main _replicator db. Another, related idea I had, was to add per job user-configurable priorities: high/default/low or numeric. However, that scheme is not as good as it could lead to permanent starvation of jobs while the round-robin db scheduling scheme still guarantees all jobs will eventually run. Would this feature be useful to have? I was thinking of giving it a try on a branch. I suspect implementing this for 3.x might be a tad simpler since scheduling is done in memory independently on each separate node so was thinking of starting there first. For main (future 4.x) we might require some coordination state to live in FDB and would have to possibly patch up couch_jobs to know about priorities. Cheers, -Nick