[crossposted to -stable as relevant there]

Hi,

as promised, a first version of the Proportional Share scheduler
that we developed is available at

    http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff

These are for a recent -STABLE (i think any version from 4.4 should
work; the only 3 files modified are kern_synch.c, kern_switch.c and
proc.h, plus a one-line change to kern_exit.c).

I have tested it a little bit on a diskless system, and it seems
to survive running a full X session with the usual set of xterm,
netscape etc. while i do a "renice" of the processes and even switch
back and forth between schedulers. But do not trust this yet for a
production system!

The sysctl variable kern.scheduler controls the scheduler in use.

    kern.scheduler=1 (default) selects Proportional Share
    kern.scheduler=0 selects the Feedback Priority ("classic BSD")

You can switch between the two schedulers by changing the value of
the sysctl variable.

To change the "weight" associated to a process, use the "nice" and
"renice" commands. As usual, positive values (+1..+20) mean the
process will get a smaller share of the CPU, negative values (-1..-20)
will get a larger share. The actual weights (which control the
relative fraction of CPU cycles given to each process) are assigned
through a lookup table which maps the value to the range
  1 ... 100 ... 1000 (min, normal, max weight).

The "old" scheduler (Feedback Priority) should be as robust and
stable as always, whereas there are still a few things to cleanup
in the Proportional Share scheduler, namely:

  * I have not looked in detail at the SMP case, so do not run
    it on an SMP kernel;

  * given that there are no priorities, processes woken up by a
    tsleep() are not guaranteed to run before all processes
    with p_priority >= PUSER; however, they are given a shorter
    deadline so they are likely to run first.

  * RTPRI and IDLEPRI processes do not have yet any special treatment
    (they all end up together with normal processes; this is trivial
    to fix, i just haven't had time to look at that).

Have fun, and please if you use it let me know of any bugs and
possibly suggestions to adapt it to -current.

        cheers
        luigi

On Tue, Jul 16, 2002 at 11:52:16PM -0700, Luigi Rizzo wrote:
> Hi,
> we have implemented a weight-based process scheduler
> for FreeBSD-stable, and are trying to port it to -current (in the
> description below replace "process" with "thread/kse" as appropriate
> for the -current case).
> 
> In order to make this work, it is convenient to have all
> scheduler-specific functions and data structures in a
> single file (kern_switch*.c), and generic support in
> another one (kern_synch.c). I believe this was also the original
> BSD design in partitioning the code between the two files.
> However, in our current code, there are some functions which
> are scheduler-specific (see below) which are misplaced.
> 
> So I would like to make the following, purely cosmetic, change
> identified as step #1 below, both for consistency with what i
> believe to be the original design, and to simplify further work
> in this area. These would go to both -current and -stable; I
> repeat, they are only cosmetic changes.
> 
> Comments ? I already spoke to julian who has no objections.
> 
> For those interested, a patch for -current is attached, and the one
> for -stable is very similar (for the records, in -stable we have a
> fully functional weight-based process scheduler, where you still
> control the weights using "nice", and you can switch between the
> old and the new scheduler at any time using a sysctl variable).
> 
> --- Re. the multi-scheduler architecture ---
> 
> The general idea is to make the process/thread/kse scheduler
> a replaceable piece of the kernel, requiring no modifications
> to the "struct proc", and with the ability of switching from
> one scheduler to another one at runtime (this both for testing
> purposes and for whatever need may arise).
> 
> 
> The way to achieve this would be the following:
> 
>  1. identify all scheduler-specific functions, put all of them
>     in one file (e.g. kern_switch.c for the default scheduler),
>     and define function pointers for all of them;
> 
>  2. use one field in "struct proc" as a link to whatever
>     scheduler-specific data structures are necessary.
>     In -stable, we have the p_pad3 field that can be used
>     for this purpose, much like the if_index field in struct ifnet.
> 
>  3. implement a function which, under control of a sysctl call,
>     activate a new scheduler (by initialising all the function
>     pointers to appropriate values) and moves all processes from
>     the old to the new one.
> 
> Step #1 is basically a cosmetic change, requiring mostly a move of
> some functions from kern_synch.c to kern_switch.c. These are
> 
>       roundrobin();
> 
>       curpriority_cmp();
> 
>       resetpriority();
> 
>       parts of schedcpu() and schedclock();
> 
> cheers
> luigi
> ----------------------------------
> 
> Index: kern_switch.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v
> retrieving revision 1.33
> diff -u -r1.33 kern_switch.c
> --- kern_switch.c     14 Jul 2002 03:43:33 -0000      1.33
> +++ kern_switch.c     16 Jul 2002 22:15:06 -0000
> @@ -97,6 +97,7 @@
>  #include <sys/mutex.h>
>  #include <sys/proc.h>
>  #include <sys/queue.h>
> +#include <sys/resourcevar.h>
>  #include <machine/critical.h>
>  
>  CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS);
> @@ -107,6 +108,9 @@
>  static struct runq runq;
>  SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq)
>  
> +static struct callout roundrobin_callout;
> +
> +static void  roundrobin(void *arg);
>  static void runq_readjust(struct runq *rq, struct kse *ke);
>  /************************************************************************
>   * Functions that manipulate runnability from a thread perspective.  *
> @@ -442,9 +446,13 @@
>  {
>       int i;
>  
> +     callout_init(&roundrobin_callout, 0);
> +
>       bzero(rq, sizeof *rq);
>       for (i = 0; i < RQ_NQS; i++)
>               TAILQ_INIT(&rq->rq_queues[i]);
> +
> +     roundrobin(NULL);
>  }
>  
>  /*
> @@ -719,3 +727,207 @@
>  }
>  #endif
>  
> +/*
> + * -- formerly in kern_synch.c
> + */
> +
> +int curpriority_cmp(struct thread *td);
> +
> +int
> +curpriority_cmp(struct thread *td)
> +{
> +     return (td->td_priority - curthread->td_priority);
> +}
> +
> +/*
> + * Force switch among equal priority processes every 100ms.
> + * We don't actually need to force a context switch of the current process.
> + * The act of firing the event triggers a context switch to softclock() and
> + * then switching back out again which is equivalent to a preemption, thus
> + * no further work is needed on the local CPU.
> + */
> +/* ARGSUSED */
> +static void
> +roundrobin(arg)
> +     void *arg;
> +{
> +
> +#ifdef SMP
> +     mtx_lock_spin(&sched_lock);
> +     forward_roundrobin();
> +     mtx_unlock_spin(&sched_lock);
> +#endif
> +
> +     callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> +}
> +
> +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
> +#define loadfactor(loadav)      (2 * (loadav))
> +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
> +extern fixpt_t ccpu;
> +#define CCPU_SHIFT      11   /* XXX dup from kern_synch.c! */
> +
> +/*
> + * Recompute process priorities, every hz ticks.
> + * MP-safe, called without the Giant mutex.
> + */
> +void schedcpu1(struct ksegrp *kg);
> +/* ARGSUSED */
> +void
> +schedcpu1(struct ksegrp *kg)
> +{
> +     register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> +     struct thread *td;
> +     struct kse *ke;
> +     int realstathz;
> +     int awake;
> +
> +        realstathz = stathz ? stathz : hz;
> +
> +     awake = 0;
> +     FOREACH_KSE_IN_GROUP(kg, ke) {
> +             /*
> +              * Increment time in/out of memory and sleep
> +              * time (if sleeping).  We ignore overflow;
> +              * with 16-bit int's (remember them?)
> +              * overflow takes 45 days.
> +              */
> +             /* XXXKSE **WRONG***/
> +             /*
> +              * the kse slptimes are not touched in wakeup
> +              * because the thread may not HAVE a KSE
> +              */
> +             if ((ke->ke_state == KES_ONRUNQ) ||
> +                 ((ke->ke_state == KES_THREAD) &&
> +                 (ke->ke_thread->td_state == TDS_RUNNING))) {
> +                     ke->ke_slptime++;
> +             } else {
> +                     ke->ke_slptime = 0;
> +                     awake = 1;
> +             }
> +
> +             /*
> +              * pctcpu is only for ps?
> +              * Do it per kse.. and add them up at the end?
> +              * XXXKSE
> +              */
> +             ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> +             /*
> +              * If the kse has been idle the entire second,
> +              * stop recalculating its priority until
> +              * it wakes up.
> +              */
> +             if (ke->ke_slptime > 1) {
> +                     continue;
> +             }
> +
> +#if  (FSHIFT >= CCPU_SHIFT)
> +             ke->ke_pctcpu += (realstathz == 100) ?
> +                 ((fixpt_t) ke->ke_cpticks) <<
> +                 (FSHIFT - CCPU_SHIFT) :
> +                 100 * (((fixpt_t) ke->ke_cpticks) <<
> +                 (FSHIFT - CCPU_SHIFT)) / realstathz;
> +#else
> +             ke->ke_pctcpu += ((FSCALE - ccpu) *
> +                 (ke->ke_cpticks * FSCALE / realstathz)) >>
> +                 FSHIFT;
> +#endif
> +             ke->ke_cpticks = 0;
> +     } /* end of kse loop */
> +     if (awake == 0) {
> +             kg->kg_slptime++;
> +     } else {
> +             kg->kg_slptime = 0;
> +     }
> +     kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> +     resetpriority(kg);
> +     FOREACH_THREAD_IN_GROUP(kg, td) {
> +             int changedqueue;
> +             if (td->td_priority >= PUSER) {
> +                     /*
> +                      * Only change the priority
> +                      * of threads that are still at their
> +                      * user priority. 
> +                      * XXXKSE This is problematic
> +                      * as we may need to re-order
> +                      * the threads on the KSEG list.
> +                      */
> +                     changedqueue =
> +                         ((td->td_priority / RQ_PPQ) !=
> +                          (kg->kg_user_pri / RQ_PPQ));
> +
> +                     td->td_priority = kg->kg_user_pri;
> +                     if (changedqueue &&
> +                         td->td_state == TDS_RUNQ) {
> +                             /* this could be optimised */
> +                             remrunqueue(td);
> +                             td->td_priority =
> +                                 kg->kg_user_pri;
> +                             setrunqueue(td);
> +                     } else {
> +                             td->td_priority = kg->kg_user_pri;
> +                     }
> +             }
> +     }
> +}
> +
> +/*
> + * Compute the priority of a process when running in user mode.
> + * Arrange to reschedule if the resulting priority is better
> + * than that of the current process.
> + */
> +void
> +resetpriority(kg)
> +     register struct ksegrp *kg;
> +{
> +     register unsigned int newpriority;
> +     struct thread *td;
> +
> +     mtx_lock_spin(&sched_lock);
> +     if (kg->kg_pri_class == PRI_TIMESHARE) {
> +             newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> +                 NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> +             newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> +                 PRI_MAX_TIMESHARE);
> +             kg->kg_user_pri = newpriority;
> +     }
> +     FOREACH_THREAD_IN_GROUP(kg, td) {
> +             maybe_resched(td);                      /* XXXKSE silly */
> +     }
> +     mtx_unlock_spin(&sched_lock);
> +}
> +
> +#if 0
> +/*
> + * We adjust the priority of the current process.  The priority of
> + * a process gets worse as it accumulates CPU time.  The cpu usage
> + * estimator (p_estcpu) is increased here.  resetpriority() will
> + * compute a different priority each time p_estcpu increases by
> + * INVERSE_ESTCPU_WEIGHT
> + * (until MAXPRI is reached).  The cpu usage estimator ramps up
> + * quite quickly when the process is running (linearly), and decays
> + * away exponentially, at a rate which is proportionally slower when
> + * the system is busy.  The basic principle is that the system will
> + * 90% forget that the process used a lot of CPU time in 5 * loadav
> + * seconds.  This causes the system to favor processes which haven't
> + * run much recently, and to round-robin among other processes.
> + */
> +void
> +schedclock(td)
> +     struct thread *td;
> +{
> +     struct kse *ke;
> +     struct ksegrp *kg;
> +
> +     KASSERT((td != NULL), ("schedlock: null thread pointer"));
> +     ke = td->td_kse;
> +     kg = td->td_ksegrp;
> +     ke->ke_cpticks++;
> +     kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
> +     if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
> +             resetpriority(kg);
> +             if (td->td_priority >= PUSER)
> +                     td->td_priority = kg->kg_user_pri;
> +     }
> +}
> +#endif
> Index: kern_synch.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.185
> diff -u -r1.185 kern_synch.c
> --- kern_synch.c      14 Jul 2002 03:43:33 -0000      1.185
> +++ kern_synch.c      16 Jul 2002 22:16:46 -0000
> @@ -67,6 +67,8 @@
>  
>  #include <machine/cpu.h>
>  
> +void schedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */
> +
>  static void sched_setup(void *dummy);
>  SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL)
>  
> @@ -76,7 +78,6 @@
>  
>  static struct callout loadav_callout;
>  static struct callout schedcpu_callout;
> -static struct callout roundrobin_callout;
>  
>  struct loadavg averunnable =
>       { {0, 0, 0}, FSCALE };  /* load average, of runnable procs */
> @@ -92,7 +93,6 @@
>  
>  static void  endtsleep(void *);
>  static void  loadav(void *arg);
> -static void  roundrobin(void *arg);
>  static void  schedcpu(void *arg);
>  
>  static int
> @@ -135,28 +135,6 @@
>  }
>  
>  /*
> - * Force switch among equal priority processes every 100ms.
> - * We don't actually need to force a context switch of the current process.
> - * The act of firing the event triggers a context switch to softclock() and
> - * then switching back out again which is equivalent to a preemption, thus
> - * no further work is needed on the local CPU.
> - */
> -/* ARGSUSED */
> -static void
> -roundrobin(arg)
> -     void *arg;
> -{
> -
> -#ifdef SMP
> -     mtx_lock_spin(&sched_lock);
> -     forward_roundrobin();
> -     mtx_unlock_spin(&sched_lock);
> -#endif
> -
> -     callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> -}
> -
> -/*
>   * Constants for digital decay and forget:
>   *   90% of (p_estcpu) usage in 5 * loadav time
>   *   95% of (p_pctcpu) usage in 60 seconds (load insensitive)
> @@ -225,7 +203,7 @@
>  #define      decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
>  
>  /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
> -static fixpt_t       ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
> +fixpt_t      ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
>  SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
>  
>  /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */
> @@ -255,105 +233,15 @@
>  schedcpu(arg)
>       void *arg;
>  {
> -     register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> -     struct thread *td;
>       struct proc *p;
> -     struct kse *ke;
>       struct ksegrp *kg;
> -     int realstathz;
> -     int awake;
>  
> -     realstathz = stathz ? stathz : hz;
>       sx_slock(&allproc_lock);
>       FOREACH_PROC_IN_SYSTEM(p) {
>               mtx_lock_spin(&sched_lock);
>               p->p_swtime++;
>               FOREACH_KSEGRP_IN_PROC(p, kg) { 
> -                     awake = 0;
> -                     FOREACH_KSE_IN_GROUP(kg, ke) {
> -                             /*
> -                              * Increment time in/out of memory and sleep
> -                              * time (if sleeping).  We ignore overflow;
> -                              * with 16-bit int's (remember them?)
> -                              * overflow takes 45 days.
> -                              */
> -                             /* XXXKSE **WRONG***/
> -                             /*
> -                              * the kse slptimes are not touched in wakeup
> -                              * because the thread may not HAVE a KSE
> -                              */
> -                             if ((ke->ke_state == KES_ONRUNQ) ||
> -                                 ((ke->ke_state == KES_THREAD) &&
> -                                 (ke->ke_thread->td_state == TDS_RUNNING))) {
> -                                     ke->ke_slptime++;
> -                             } else {
> -                                     ke->ke_slptime = 0;
> -                                     awake = 1;
> -                             }
> -
> -                             /*
> -                              * pctcpu is only for ps?
> -                              * Do it per kse.. and add them up at the end?
> -                              * XXXKSE
> -                              */
> -                             ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> -                             /*
> -                              * If the kse has been idle the entire second,
> -                              * stop recalculating its priority until
> -                              * it wakes up.
> -                              */
> -                             if (ke->ke_slptime > 1) {
> -                                     continue;
> -                             }
> -
> -#if  (FSHIFT >= CCPU_SHIFT)
> -                             ke->ke_pctcpu += (realstathz == 100) ?
> -                                 ((fixpt_t) ke->ke_cpticks) <<
> -                                 (FSHIFT - CCPU_SHIFT) :
> -                                 100 * (((fixpt_t) ke->ke_cpticks) <<
> -                                 (FSHIFT - CCPU_SHIFT)) / realstathz;
> -#else
> -                             ke->ke_pctcpu += ((FSCALE - ccpu) *
> -                                 (ke->ke_cpticks * FSCALE / realstathz)) >>
> -                                 FSHIFT;
> -#endif
> -                             ke->ke_cpticks = 0;
> -                     } /* end of kse loop */
> -                     if (awake == 0) {
> -                             kg->kg_slptime++;
> -                     } else {
> -                             kg->kg_slptime = 0;
> -                     }
> -                     kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> -                     resetpriority(kg);
> -                     FOREACH_THREAD_IN_GROUP(kg, td) {
> -                             int changedqueue;
> -                             if (td->td_priority >= PUSER) {
> -                                     /*
> -                                      * Only change the priority
> -                                      * of threads that are still at their
> -                                      * user priority. 
> -                                      * XXXKSE This is problematic
> -                                      * as we may need to re-order
> -                                      * the threads on the KSEG list.
> -                                      */
> -                                     changedqueue =
> -                                         ((td->td_priority / RQ_PPQ) !=
> -                                          (kg->kg_user_pri / RQ_PPQ));
> -
> -                                     td->td_priority = kg->kg_user_pri;
> -                                     if (changedqueue &&
> -                                         td->td_state == TDS_RUNQ) {
> -                                             /* this could be optimised */
> -                                             remrunqueue(td);
> -                                             td->td_priority =
> -                                                 kg->kg_user_pri;
> -                                             setrunqueue(td);
> -                                     } else {
> -                                             td->td_priority = kg->kg_user_pri;
> -                                     }
> -                             }
> -                     }
> +                     schedcpu1(kg);
>               } /* end of ksegrp loop */
>               mtx_unlock_spin(&sched_lock);
>       } /* end of process loop */
> @@ -944,32 +832,6 @@
>  }
>  
>  /*
> - * Compute the priority of a process when running in user mode.
> - * Arrange to reschedule if the resulting priority is better
> - * than that of the current process.
> - */
> -void
> -resetpriority(kg)
> -     register struct ksegrp *kg;
> -{
> -     register unsigned int newpriority;
> -     struct thread *td;
> -
> -     mtx_lock_spin(&sched_lock);
> -     if (kg->kg_pri_class == PRI_TIMESHARE) {
> -             newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> -                 NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> -             newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> -                 PRI_MAX_TIMESHARE);
> -             kg->kg_user_pri = newpriority;
> -     }
> -     FOREACH_THREAD_IN_GROUP(kg, td) {
> -             maybe_resched(td);                      /* XXXKSE silly */
> -     }
> -     mtx_unlock_spin(&sched_lock);
> -}
> -
> -/*
>   * Compute a tenex style load average of a quantity on
>   * 1, 5 and 15 minute intervals.
>   * XXXKSE   Needs complete rewrite when correct info is available.
> @@ -1022,11 +884,9 @@
>  {
>  
>       callout_init(&schedcpu_callout, 1);
> -     callout_init(&roundrobin_callout, 0);
>       callout_init(&loadav_callout, 0);
>  
>       /* Kick off timeout driven events by calling first time. */
> -     roundrobin(NULL);
>       schedcpu(NULL);
>       loadav(NULL);
>  }
> 
> ----- End forwarded message -----
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-arch" in the body of the message

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message

Reply via email to