Re: [DISCUSS] Replicator scheduler improvement

2021-01-26 Thread Nick Vatamaniuc
I created an RFC based on the discussion above:
https://github.com/apache/couchdb-documentation/pull/617

-Nick

On Sun, Jan 17, 2021 at 12:22 AM Nick Vatamaniuc  wrote:
>
> On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet  wrote:
> >
> > On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right.
> > There are two places where job priority comes
> > > into play: one is when jobs are started, and then the other is when
> > > they are stopped. When starting jobs, both continuous and one-shot
> > > (normal) jobs are at the same priority. However when jobs are stopped,
> > > we try to stop only the continuous jobs. The only time we'd stop
> > > one-shot jobs is when the operator manually lowers the max_jobs config
> > > limit. This means there is an imbalance between the starting and
> > > stopping algorithms and eventually a large number of one-shot
> > > replications could completely crowd out and replace all the continuous
> > > ones. Continuous ones would periodically stop, and if the new ones
> > > started are one-shot, those would slowly accumulate over time.
> >
> > I do think this is a concern, but I also think it can be addressed
> > separately from this proposal. I don't have a good answer re: fixing it.
> >
> > > The idea is to still guarantee fairness with hopefully an easy to
> > > understand algorithm. For the 10 different replicator dbs we still
> > > round-robin among them, and pick the jobs with the lowest start time
> > > from each in turn, then, do the same thing next cycle, until we have
> > > enough jobs picked out. So for example if we had this setup:
> > >
> > > db1: j4
> > > db2: j3, j5
> > > db3: j1, j2
> > >
> > > Where the index for the jobs is the start time (j1 was last stated
> > > before j2), we'd pick them as j1, j3, j4 during the first round-robin
> > > selection loop, then we'd be left with
> > >
> > > db1 :
> > > db2 : j5
> > > db3 : j2
> > >
> > > Next we'd pick j2 and j5 and so on. So overall we'd get
> > > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be:
> > > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2:
> > >
> > > db1: j1, j2, j3 - j41, j43 - j99
> > > db2: j42, j100
> > >
> > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The
> > > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make
> > > sure all the dbs get a fair chance to run their jobs. If there is one
> > > db only, the algorithm doesn't change from the current one. However,
> > > one thing worrying me here is this implies either a lot of sorting or
> > > keeping a scheduler queue per db, so things could end up being quite
> > > expensive performance-wise.
> >
> > So this doesn't really let us do any sort of priority queue, which is
> > unfortunate. You can take your higher priority tasks and move them
> > outside of the db that has all of your background tasks in it, but if
> > you have > max_jobs/num_replicator_dbs in that separate db, some will
> > still get ignored. Fair share would provide a way to ensure starvation
> > doesn't occur.
> >
>
> I think my idea would just be a simpler version of fair share where
> the shares are always equal amongst the dbs, and without the need to
> explicitly handle the concept of a "share". Jobs would just start and
> stop based on a db round-robin schedule, then based on start times. So
> the few jobs in a separate db would get a higher priority compared to
> the priority of a db with lots of jobs in it. But as you said, that
> would not be an explicit priority mechanism so might be awkward to use
> it to boost priorities.
>
> > (Also consider the situation where num_replicator_dbs > max_jobs...)
> >
>
> Was this in regards to taking the higher priority tasks and moving
> them outside of the db? In the original idea we should handle the case
> where num_replicator_dbs > max_jobs: during each scheduling cycle, the
> dbs with jobs with lower start times at the front would get picked
> first. But it would mean that some of the dbs don't get scheduled
> during a scheduling cycle, they might have to wait for the next one or
> the one after, etc. But that's unavoidable anyway. The key is that it
> would still be fair across multiple scheduling cycles.
>
> > >> Would a fair-share scheduler [1] make sense here? We can't borrow many
> > >> of the OS-level scheduling ideas because we don't have the same insight
> > >> into the runtime characteristics of our jobs, but fair share might be a
> > >> good match.
> > >>
> > >> Here's one way it could be implemented:
> > >>
> > >> * Each replicator DB could represent a logical grouping of jobs, e.g.
> > >>background continuous backup jobs, one-shot migration jobs,
> > >>high-priority user requests, etc.
> > >>
> > >> * An INI-level config could determine how these groups are handled:
> > >>
> > >>* 100% fair (default) -- total available job slots are divided by
> > >>  the number of replicator DBs, then allocated round-robin within
> > >>  th

Re: [DISCUSS] Replicator scheduler improvement

2021-01-16 Thread Nick Vatamaniuc
On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet  wrote:
>
> On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right.
> There are two places where job priority comes
> > into play: one is when jobs are started, and then the other is when
> > they are stopped. When starting jobs, both continuous and one-shot
> > (normal) jobs are at the same priority. However when jobs are stopped,
> > we try to stop only the continuous jobs. The only time we'd stop
> > one-shot jobs is when the operator manually lowers the max_jobs config
> > limit. This means there is an imbalance between the starting and
> > stopping algorithms and eventually a large number of one-shot
> > replications could completely crowd out and replace all the continuous
> > ones. Continuous ones would periodically stop, and if the new ones
> > started are one-shot, those would slowly accumulate over time.
>
> I do think this is a concern, but I also think it can be addressed
> separately from this proposal. I don't have a good answer re: fixing it.
>
> > The idea is to still guarantee fairness with hopefully an easy to
> > understand algorithm. For the 10 different replicator dbs we still
> > round-robin among them, and pick the jobs with the lowest start time
> > from each in turn, then, do the same thing next cycle, until we have
> > enough jobs picked out. So for example if we had this setup:
> >
> > db1: j4
> > db2: j3, j5
> > db3: j1, j2
> >
> > Where the index for the jobs is the start time (j1 was last stated
> > before j2), we'd pick them as j1, j3, j4 during the first round-robin
> > selection loop, then we'd be left with
> >
> > db1 :
> > db2 : j5
> > db3 : j2
> >
> > Next we'd pick j2 and j5 and so on. So overall we'd get
> > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be:
> > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2:
> >
> > db1: j1, j2, j3 - j41, j43 - j99
> > db2: j42, j100
> >
> > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The
> > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make
> > sure all the dbs get a fair chance to run their jobs. If there is one
> > db only, the algorithm doesn't change from the current one. However,
> > one thing worrying me here is this implies either a lot of sorting or
> > keeping a scheduler queue per db, so things could end up being quite
> > expensive performance-wise.
>
> So this doesn't really let us do any sort of priority queue, which is
> unfortunate. You can take your higher priority tasks and move them
> outside of the db that has all of your background tasks in it, but if
> you have > max_jobs/num_replicator_dbs in that separate db, some will
> still get ignored. Fair share would provide a way to ensure starvation
> doesn't occur.
>

I think my idea would just be a simpler version of fair share where
the shares are always equal amongst the dbs, and without the need to
explicitly handle the concept of a "share". Jobs would just start and
stop based on a db round-robin schedule, then based on start times. So
the few jobs in a separate db would get a higher priority compared to
the priority of a db with lots of jobs in it. But as you said, that
would not be an explicit priority mechanism so might be awkward to use
it to boost priorities.

> (Also consider the situation where num_replicator_dbs > max_jobs...)
>

Was this in regards to taking the higher priority tasks and moving
them outside of the db? In the original idea we should handle the case
where num_replicator_dbs > max_jobs: during each scheduling cycle, the
dbs with jobs with lower start times at the front would get picked
first. But it would mean that some of the dbs don't get scheduled
during a scheduling cycle, they might have to wait for the next one or
the one after, etc. But that's unavoidable anyway. The key is that it
would still be fair across multiple scheduling cycles.

> >> Would a fair-share scheduler [1] make sense here? We can't borrow many
> >> of the OS-level scheduling ideas because we don't have the same insight
> >> into the runtime characteristics of our jobs, but fair share might be a
> >> good match.
> >>
> >> Here's one way it could be implemented:
> >>
> >> * Each replicator DB could represent a logical grouping of jobs, e.g.
> >>background continuous backup jobs, one-shot migration jobs,
> >>high-priority user requests, etc.
> >>
> >> * An INI-level config could determine how these groups are handled:
> >>
> >>* 100% fair (default) -- total available job slots are divided by
> >>  the number of replicator DBs, then allocated round-robin within
> >>  that DB. The difference from today's setup is that each DB gets
> >>  its own round-robin scheduler. I *think* this might be what Nick
> >>  was suggesting, but I can't be 100% sure.
> >>
> >
> > Yeah, I think that's what I had in mind too but just having an equal
> > priority between them. I like the idea of being able to tweak
> > priorities for repl

Re: [DISCUSS] Replicator scheduler improvement

2021-01-15 Thread Joan Touzet
On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right.
There are two places where job priority comes
> into play: one is when jobs are started, and then the other is when
> they are stopped. When starting jobs, both continuous and one-shot
> (normal) jobs are at the same priority. However when jobs are stopped,
> we try to stop only the continuous jobs. The only time we'd stop
> one-shot jobs is when the operator manually lowers the max_jobs config
> limit. This means there is an imbalance between the starting and
> stopping algorithms and eventually a large number of one-shot
> replications could completely crowd out and replace all the continuous
> ones. Continuous ones would periodically stop, and if the new ones
> started are one-shot, those would slowly accumulate over time.

I do think this is a concern, but I also think it can be addressed
separately from this proposal. I don't have a good answer re: fixing it.

> The idea is to still guarantee fairness with hopefully an easy to
> understand algorithm. For the 10 different replicator dbs we still
> round-robin among them, and pick the jobs with the lowest start time
> from each in turn, then, do the same thing next cycle, until we have
> enough jobs picked out. So for example if we had this setup:
> 
> db1: j4
> db2: j3, j5
> db3: j1, j2
> 
> Where the index for the jobs is the start time (j1 was last stated
> before j2), we'd pick them as j1, j3, j4 during the first round-robin
> selection loop, then we'd be left with
> 
> db1 :
> db2 : j5
> db3 : j2
> 
> Next we'd pick j2 and j5 and so on. So overall we'd get
> j1,j3,j4,j2,j5. If we look at the dbs they came from it would be:
> db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2:
> 
> db1: j1, j2, j3 - j41, j43 - j99
> db2: j42, j100
> 
> The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The
> dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make
> sure all the dbs get a fair chance to run their jobs. If there is one
> db only, the algorithm doesn't change from the current one. However,
> one thing worrying me here is this implies either a lot of sorting or
> keeping a scheduler queue per db, so things could end up being quite
> expensive performance-wise.

So this doesn't really let us do any sort of priority queue, which is
unfortunate. You can take your higher priority tasks and move them
outside of the db that has all of your background tasks in it, but if
you have > max_jobs/num_replicator_dbs in that separate db, some will
still get ignored. Fair share would provide a way to ensure starvation
doesn't occur.

(Also consider the situation where num_replicator_dbs > max_jobs...)

>> Would a fair-share scheduler [1] make sense here? We can't borrow many
>> of the OS-level scheduling ideas because we don't have the same insight
>> into the runtime characteristics of our jobs, but fair share might be a
>> good match.
>>
>> Here's one way it could be implemented:
>>
>> * Each replicator DB could represent a logical grouping of jobs, e.g.
>>background continuous backup jobs, one-shot migration jobs,
>>high-priority user requests, etc.
>>
>> * An INI-level config could determine how these groups are handled:
>>
>>* 100% fair (default) -- total available job slots are divided by
>>  the number of replicator DBs, then allocated round-robin within
>>  that DB. The difference from today's setup is that each DB gets
>>  its own round-robin scheduler. I *think* this might be what Nick
>>  was suggesting, but I can't be 100% sure.
>>
> 
> Yeah, I think that's what I had in mind too but just having an equal
> priority between them. I like the idea of being able to tweak
> priorities for replicator dbs. Thanks for the reference to the OS
> scheduler and the paper! I looked at the Multiqueue Feedback Scheduler
> http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was
> used in Linux previously and is still used by some operating system
> for CPU scheduling. The reason I looked at it, is that initially I
> contemplated a static user-configurable priorities, but realized we'd
> need something like the Multiqueue Feedback Scheduler to dynamically
> tweak job priority to avoid the case where some were permanently
> starved.

Yeah, I did a full review of mlfq and the Linux scehdulers before
suggesting Fair Share... I think the design goal of "is easier to
intuit" is an important one here, and I found it easier to reason about
my variant than yours. But I'm not attached to my idea, let's do what is
the easiest to explain and meets the other design goals.

>> [Fair share worked example]
> 
> I think this might work, but could also starve lower priority jobs. I
> was thinking when we have a high/_replicator:99 db which is fed newer
> jobs continuously, and the low/_replicator:1 which never gets a chance
> to run if its share gets rounded down to 0. I think one way to avoid
> that would be to ensure we pick at least one job

Re: [DISCUSS] Replicator scheduler improvement

2021-01-15 Thread Nick Vatamaniuc
On Thu, Jan 14, 2021 at 4:51 PM Joan Touzet  wrote:
>
> On 2021-01-12 12:55 p.m., Nick Vatamaniuc wrote:
> > Hello everyone
>
> Hi (Dr.) Nick! ;)

"One quick application of Spiffy and we got ourselves a better scheduler!"

I had to look up the Simpson reference as I had completely forgotten
about him :-)

> > I wanted to see what we thought about adding a scheduling improvement
> > to the replicator. Specifically adding round-robin fairness amongst
> > different _replicator dbs
>
> We've needed some sort of priority system for a while -- critically, in
> some cases, our failure to offer something has forced my hand into
> recommending use of external replication scheduling techniques. I'd
> never thought of using multiple databases as a way of handling this
> problem, though.
>
> > Currently, the scheduler runs all the jobs in the system fairly. It
> > does it by using the jobs' "last started" timestamp to select which
> > jobs to stop and start next. So each scheduling cycle, the jobs which
> > ran the longest (have the lowest start time) get stopped, then, from
> > the list of pending jobs we also pick those with the lowest start
> > times (since they waited the longest) to run next.
>
> There is also some difference between continuous and one-shot
> replication jobs, isn't there? I seem to recall one-shot jobs get
> priority over continuous ones.

You're absolutely right. There are two places where job priority comes
into play: one is when jobs are started, and then the other is when
they are stopped. When starting jobs, both continuous and one-shot
(normal) jobs are at the same priority. However when jobs are stopped,
we try to stop only the continuous jobs. The only time we'd stop
one-shot jobs is when the operator manually lowers the max_jobs config
limit. This means there is an imbalance between the starting and
stopping algorithms and eventually a large number of one-shot
replications could completely crowd out and replace all the continuous
ones. Continuous ones would periodically stop, and if the new ones
started are one-shot, those would slowly accumulate over time.

> > However, this algorithm can be unfair among replication jobs from
> > different _replicator dbs. For example, if max_jobs=10 and one
> > _replicator db has a 1000 jobs and another has only one doc then,
> > that one job might have to wait for quite a while to get its turn to
> >  run. If we picked fairly amongst _replicator dbs, then for example,
> >  the one job would get at least one of the 10 max_jobs run "slots",
> > and the rest of the 9 slots would go to the replicator db with 1000
> > docs. If there would be 11 _replicator dbs, then a docs from the ones
> > with the lowest start times amongst them would be picked first.
>
> That does sound unfair. I wasn't aware of this "gotcha" with multiple
> replicator DBs (in the field I rarely see >1 replicator DB) but it
> intuitively makes sense - all replicator DBs are unioned in the
> scheduler's queue and treated like a single, big DB. (There is basically
> no advantage to multiple _replicator DBs today then, other than perhaps
> security / obscurity reasons, is there?)

Exactly, they are unioned without any regard to how they ended up in
the scheduler. Today we could use multiple db to group replication
jobs for security or for management, like you suggested. You can
delete a whole _replicator db and stop a whole group of jobs at once.
Or you could replicate the _replicator dbs themselves around even!

> > This feature would also allow running some quick replication jobs,
> > when there is already a full queue main _replicator db jobs by
> > creating a "quickjobs/_replicator" db, and insert these new jobs
> > there. With this new scheme they would start running right away even
> > if the queue is full with jobs from the main _replicator db. Another,
> > related idea I had, was to add per job user-configurable priorities:
> > high/default/low or numeric. However, that scheme is not as good as
> > it could lead to permanent starvation of jobs while the round-robin
> > db scheduling scheme still guarantees all jobs will eventually run.
>
> I think this proposal can be improved. I like the idea of leveraging
> multiple replicator DBs to provide some sort of priority tweak, but I am
> not sure I can predict what will happen in the case of e.g. 10 different
> replicator DBs fighting with each other in the scheduler as you've laid
> it out.

The idea is to still guarantee fairness with hopefully an easy to
understand algorithm. For the 10 different replicator dbs we still
round-robin among them, and pick the jobs with the lowest start time
from each in turn, then, do the same thing next cycle, until we have
enough jobs picked out. So for example if we had this setup:

db1: j4
db2: j3, j5
db3: j1, j2

Where the index for the jobs is the start time (j1 was last stated
before j2), we'd pick them as j1, j3, j4 during the first round-robin
selection loop, then we'd be left with

db1 :
db2 

Re: [DISCUSS] Replicator scheduler improvement

2021-01-14 Thread Joan Touzet

On 2021-01-12 12:55 p.m., Nick Vatamaniuc wrote:

Hello everyone


Hi (Dr.) Nick! ;)


I wanted to see what we thought about adding a scheduling improvement
to the replicator. Specifically adding round-robin fairness amongst
different _replicator dbs


We've needed some sort of priority system for a while -- critically, in
some cases, our failure to offer something has forced my hand into
recommending use of external replication scheduling techniques. I'd
never thought of using multiple databases as a way of handling this
problem, though.

Currently, the scheduler runs all the jobs in the system fairly. It 
does it by using the jobs' "last started" timestamp to select which 
jobs to stop and start next. So each scheduling cycle, the jobs which

ran the longest (have the lowest start time) get stopped, then, from
the list of pending jobs we also pick those with the lowest start
times (since they waited the longest) to run next.


There is also some difference between continuous and one-shot
replication jobs, isn't there? I seem to recall one-shot jobs get
priority over continuous ones.

However, this algorithm can be unfair among replication jobs from 
different _replicator dbs. For example, if max_jobs=10 and one 
_replicator db has a 1000 jobs and another has only one doc then, 
that one job might have to wait for quite a while to get its turn to

 run. If we picked fairly amongst _replicator dbs, then for example,
 the one job would get at least one of the 10 max_jobs run "slots", 
and the rest of the 9 slots would go to the replicator db with 1000 
docs. If there would be 11 _replicator dbs, then a docs from the ones

with the lowest start times amongst them would be picked first.


That does sound unfair. I wasn't aware of this "gotcha" with multiple
replicator DBs (in the field I rarely see >1 replicator DB) but it
intuitively makes sense - all replicator DBs are unioned in the
scheduler's queue and treated like a single, big DB. (There is basically
no advantage to multiple _replicator DBs today then, other than perhaps
security / obscurity reasons, is there?)

This feature would also allow running some quick replication jobs, 
when there is already a full queue main _replicator db jobs by 
creating a "quickjobs/_replicator" db, and insert these new jobs 
there. With this new scheme they would start running right away even 
if the queue is full with jobs from the main _replicator db. Another,

related idea I had, was to add per job user-configurable priorities:
high/default/low or numeric. However, that scheme is not as good as
it could lead to permanent starvation of jobs while the round-robin
db scheduling scheme still guarantees all jobs will eventually run.


I think this proposal can be improved. I like the idea of leveraging
multiple replicator DBs to provide some sort of priority tweak, but I am
not sure I can predict what will happen in the case of e.g. 10 different
replicator DBs fighting with each other in the scheduler as you've laid
it out.

Would a fair-share scheduler [1] make sense here? We can't borrow many
of the OS-level scheduling ideas because we don't have the same insight
into the runtime characteristics of our jobs, but fair share might be a
good match.

Here's one way it could be implemented:

* Each replicator DB could represent a logical grouping of jobs, e.g.
  background continuous backup jobs, one-shot migration jobs,
  high-priority user requests, etc.

* An INI-level config could determine how these groups are handled:

  * 100% fair (default) -- total available job slots are divided by
the number of replicator DBs, then allocated round-robin within
that DB. The difference from today's setup is that each DB gets
its own round-robin scheduler. I *think* this might be what Nick
was suggesting, but I can't be 100% sure.

  * Specific priority -- A list of DBs and their fixed percentages.
Example: _replicator:30,high/_replicator:50,low/_replicator:20

If a configured database is not present, or has no active jobs in
it, its fixed proportion of cycles are "given up" to the other
replicator DBs.

* Jobs are then allocated based on these breakdowns, and we continue
  to use round-robin within each database to maintain the current
  approach. This means no visible change for upgrading users with a
  single _replicator DB.


Example, with the numbers picked to make the math turn out nice:

```
[replicator]
db_priorities = _replicator:30,high/_replicator:50,low/_replicator:20
max_jobs = 100
```

The % allocation of jobs across DBs is, by default:

  _replicator:  30 / (30+20+50) = 30%
  high/_replicator: 20 / (30+20+50) = 20%
  low/_replicator:  50 / (30+20+50) = 50%

First, put 100 jobs into _replicator. Because the other two DBs have no
jobs, all 100 free jobs are allocated to those jobs:

  _replicator:  30 / (30) = 100% * 100 jobs = 100

Now, we add another 100 jobs into the high DB. Since low is empty, we
recalculate calculate the dist

[DISCUSS] Replicator scheduler improvement

2021-01-12 Thread Nick Vatamaniuc
Hello everyone

I wanted to see what we thought about adding a scheduling improvement
to the replicator. Specifically adding round-robin fairness amongst
different _replicator dbs.

Currently, the scheduler runs all the jobs in the system fairly. It
does it by using the jobs' "last started" timestamp to select which
jobs to stop and start next. So each scheduling cycle, the jobs which
ran the longest (have the lowest start time) get stopped, then, from
the list of pending jobs we also pick those with the lowest start
times (since they waited the longest) to run next. However, this
algorithm can be unfair among replication jobs from different
_replicator dbs. For example, if max_jobs=10 and one _replicator db
has a 1000 jobs and another has only one doc then, that one job might
have to wait for quite a while to get its turn to run. If we picked
fairly amongst _replicator dbs, then for example, the one job would
get at least one of the 10 max_jobs run "slots", and the rest of the 9
slots would go to the replicator db with 1000 docs. If there would be
11 _replicator dbs, then a docs from the ones with the lowest start
times amongst them would be picked first.

This feature would also allow running some quick replication jobs,
when there is already a full queue main _replicator db jobs by
creating a "quickjobs/_replicator" db, and insert these new jobs
there. With this new scheme they would start running right away even
if the queue is full with jobs from the main _replicator db. Another,
related idea I had, was to add per job user-configurable priorities:
high/default/low or numeric. However, that scheme is not as good as it
could lead to permanent starvation of jobs while the round-robin db
scheduling scheme still guarantees all jobs will eventually run.

Would this feature be useful to have? I was thinking of giving it a
try on a branch. I suspect implementing this for 3.x might be a tad
simpler since scheduling is done in memory independently on each
separate node so was thinking of starting there first. For main
(future 4.x) we might require some coordination state to live in FDB
and would have to possibly patch up couch_jobs to know about
priorities.

Cheers,
-Nick