I created an RFC based on the discussion above:
https://github.com/apache/couchdb-documentation/pull/617

-Nick

On Sun, Jan 17, 2021 at 12:22 AM Nick Vatamaniuc <vatam...@gmail.com> wrote:
>
> On Fri, Jan 15, 2021 at 6:41 PM Joan Touzet <woh...@apache.org> wrote:
> >
> > On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right.
> > There are two places where job priority comes
> > > into play: one is when jobs are started, and then the other is when
> > > they are stopped. When starting jobs, both continuous and one-shot
> > > (normal) jobs are at the same priority. However when jobs are stopped,
> > > we try to stop only the continuous jobs. The only time we'd stop
> > > one-shot jobs is when the operator manually lowers the max_jobs config
> > > limit. This means there is an imbalance between the starting and
> > > stopping algorithms and eventually a large number of one-shot
> > > replications could completely crowd out and replace all the continuous
> > > ones. Continuous ones would periodically stop, and if the new ones
> > > started are one-shot, those would slowly accumulate over time.
> >
> > I do think this is a concern, but I also think it can be addressed
> > separately from this proposal. I don't have a good answer re: fixing it.
> >
> > > The idea is to still guarantee fairness with hopefully an easy to
> > > understand algorithm. For the 10 different replicator dbs we still
> > > round-robin among them, and pick the jobs with the lowest start time
> > > from each in turn, then, do the same thing next cycle, until we have
> > > enough jobs picked out. So for example if we had this setup:
> > >
> > > db1: j4
> > > db2: j3, j5
> > > db3: j1, j2
> > >
> > > Where the index for the jobs is the start time (j1 was last stated
> > > before j2), we'd pick them as j1, j3, j4 during the first round-robin
> > > selection loop, then we'd be left with
> > >
> > > db1 :
> > > db2 : j5
> > > db3 : j2
> > >
> > > Next we'd pick j2 and j5 and so on. So overall we'd get
> > > j1,j3,j4,j2,j5. If we look at the dbs they came from it would be:
> > > db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2:
> > >
> > > db1: j1, j2, j3 - j41, j43 - j99
> > > db2: j42, j100
> > >
> > > The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The
> > > dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make
> > > sure all the dbs get a fair chance to run their jobs. If there is one
> > > db only, the algorithm doesn't change from the current one. However,
> > > one thing worrying me here is this implies either a lot of sorting or
> > > keeping a scheduler queue per db, so things could end up being quite
> > > expensive performance-wise.
> >
> > So this doesn't really let us do any sort of priority queue, which is
> > unfortunate. You can take your higher priority tasks and move them
> > outside of the db that has all of your background tasks in it, but if
> > you have > max_jobs/num_replicator_dbs in that separate db, some will
> > still get ignored. Fair share would provide a way to ensure starvation
> > doesn't occur.
> >
>
> I think my idea would just be a simpler version of fair share where
> the shares are always equal amongst the dbs, and without the need to
> explicitly handle the concept of a "share". Jobs would just start and
> stop based on a db round-robin schedule, then based on start times. So
> the few jobs in a separate db would get a higher priority compared to
> the priority of a db with lots of jobs in it. But as you said, that
> would not be an explicit priority mechanism so might be awkward to use
> it to boost priorities.
>
> > (Also consider the situation where num_replicator_dbs > max_jobs...)
> >
>
> Was this in regards to taking the higher priority tasks and moving
> them outside of the db? In the original idea we should handle the case
> where num_replicator_dbs > max_jobs: during each scheduling cycle, the
> dbs with jobs with lower start times at the front would get picked
> first. But it would mean that some of the dbs don't get scheduled
> during a scheduling cycle, they might have to wait for the next one or
> the one after, etc. But that's unavoidable anyway. The key is that it
> would still be fair across multiple scheduling cycles.
>
> > >> Would a fair-share scheduler [1] make sense here? We can't borrow many
> > >> of the OS-level scheduling ideas because we don't have the same insight
> > >> into the runtime characteristics of our jobs, but fair share might be a
> > >> good match.
> > >>
> > >> Here's one way it could be implemented:
> > >>
> > >> * Each replicator DB could represent a logical grouping of jobs, e.g.
> > >>    background continuous backup jobs, one-shot migration jobs,
> > >>    high-priority user requests, etc.
> > >>
> > >> * An INI-level config could determine how these groups are handled:
> > >>
> > >>    * 100% fair (default) -- total available job slots are divided by
> > >>      the number of replicator DBs, then allocated round-robin within
> > >>      that DB. The difference from today's setup is that each DB gets
> > >>      its own round-robin scheduler. I *think* this might be what Nick
> > >>      was suggesting, but I can't be 100% sure.
> > >>
> > >
> > > Yeah, I think that's what I had in mind too but just having an equal
> > > priority between them. I like the idea of being able to tweak
> > > priorities for replicator dbs. Thanks for the reference to the OS
> > > scheduler and the paper! I looked at the Multiqueue Feedback Scheduler
> > > http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was
> > > used in Linux previously and is still used by some operating system
> > > for CPU scheduling. The reason I looked at it, is that initially I
> > > contemplated a static user-configurable priorities, but realized we'd
> > > need something like the Multiqueue Feedback Scheduler to dynamically
> > > tweak job priority to avoid the case where some were permanently
> > > starved.
> >
> > Yeah, I did a full review of mlfq and the Linux scehdulers before
> > suggesting Fair Share... I think the design goal of "is easier to
> > intuit" is an important one here, and I found it easier to reason about
> > my variant than yours. But I'm not attached to my idea, let's do what is
> > the easiest to explain and meets the other design goals.
>
> That makes sense. Thank you for the help with the design. I like the
> fair share idea and will give that a try first.
>
> > >> [Fair share worked example]
> > >
> > > I think this might work, but could also starve lower priority jobs. I
> > > was thinking when we have a high/_replicator:99 db which is fed newer
> > > jobs continuously, and the low/_replicator:1 which never gets a chance
> > > to run if its share gets rounded down to 0. I think one way to avoid
> > > that would be to ensure we pick at least one job from each db, and
> > > then apply the proportional shares, or just make it round down 1 as a
> > > minimum...
> >
> > I would consider that user error in configuration, which I guess is
> > similar to the corner case in your proposal of having >max_jobs
> > replicator DBs.
> >
> > But yeah, min(1, <formula previously presented>) solves that problem
> > nicely too.
> >
> > > The paper and your proposal presents an interesting way of
> > > "materializing" the proportional shares as explicit quantity, and the
> > > current and my scheme don't have that part (there is no
> > > "scheduler_share" variable and we just pick the jobs in the order of
> > > start time).  But I like the idea and it's worth giving it a try. It
> > > might have to be one of those things, that once we start playing with
> > > the code, some things might become more obvious and easy to do and
> > > some too complicated.
> >
> > Sure, let's try it!
> >
> > > Thank you, Joan, for the great suggestions and for taking a look!
> >
> > My pleasure!
> >
> > If you need a use case that is hard to reason about, consider a
> > resource-constrained cluster (where # of replications vastly outstrips
> > resources avialable) and you need to ensure fairness across a very large
> > number of continuous and one-shot replications. Within these the
> > one-shot are considered higher priority, but they should not completely
> > drown out the continuous ones.
>
> This is a tricky one, yeah. We've done this by design, but it has its
> issues. An interesting thing to think about is with a fair share
> configurability, if a user assigned all the one-shot replications to a
> low/_replicator:10 db, and then after those jobs are started, the user
> adds lots continuous replications to their default _replicator db.
> Does it mean we would break the "snapshotting" promise and stop the
> one-shot replications to make room for the now higher priority
> continuous ones, or, do we keep them running. I would think we should
> probably stick to keeping them running, and just rely on them starting
> at a slower rate perhaps, since their priority is :10 vs say :90 in
> the default db...
>
> > I can give more detail on the example if it sounds interesting to you -
> > this was a client system I worked on for a while (after it was too late
> > to improve the design) and we never were able to get CouchDB's
> > replication scheduler to do the right thing. Everything got replaced by
> > an external scheduler that knew what max_jobs was, and did everything as
> > one-shot replications.
>
> Ah that's sad to hear. Sounds like this scheduling improvement is long 
> overdue.
>
> >
> > -Joan
>
> -Nick

Reply via email to