Re: [ckrm-tech] [PATCH] [3/3] 03-ckrm-io.patch

Gerrit Huizenga Thu, 12 May 2005 09:32:04 -0700

See previous comments on Subject & description for change log.

Lots of trailing white space.  We all know how popular that is...


On Wed, 11 May 2005 17:26:08 PDT, Chandra Seetharaman wrote:
> 
>  Documentation/ckrm/ckrm-io    |   98 ++++
>  drivers/block/Kconfig.iosched |    9 
>  drivers/block/Makefile        |    4 
>  drivers/block/ckrm-io.c       |  889 
> ++++++++++++++++++++++++++++++++++++++++++
>  drivers/block/ps-iosched.c    |  345 +++++++++++++---
>  include/linux/ckrm-io.h       |  134 ++++++
>  include/linux/proc_fs.h       |    1 
>  init/Kconfig                  |   13 
>  8 files changed, 1421 insertions(+), 72 deletions(-)
> 
> Signed-off-by:  Shailabh Nagar <[EMAIL PROTECTED]>
> Signed-off-by:  Chandra Seetharaman <[EMAIL PROTECTED]> 
> 
> Index: linux-2.6.12-rc3/Documentation/ckrm/ckrm-io
> ===================================================================
> --- /dev/null
> +++ linux-2.6.12-rc3/Documentation/ckrm/ckrm-io
> @@ -0,0 +1,98 @@
> +CKRM I/O controller
> +
> +Please send feedback to [EMAIL PROTECTED]
> +
> +
> +The I/O controller consists of 
> +- a new I/O scheduler called ps-iosched which is an incremental update 
> +to the cfq ioscheduler. It has enough differences with cfq to warrant a
> +separate I/O scheduler. 
> +- ckrm-io : the controller which interfaces ps-iosched with CKRM's core
> +
> +ckrm-io enforces shares at the granularity of an "epoch", currently defined 
> as
> +1 second. The relative share of each class in rcfs is translated to an 
> absolute
> +"sectorate" for each block device managed by ckrm-io. Sectorate is defined as
> +average number of sectors served per epoch for a class. This value is treated
> +as a hard limit - every time a class exceeds this average for *any* device, 
> the
> +class' I/O gets deferred till the average drops back below the limit.

Okay - it looks like "sectorate" is used throughout - can it be changed
to "sector_rate" which would make it *much* more readable/understandable.

> +
> +Compiling ckrm-io
> +-----------------
> +Currently, please compile it into the kernel using the config parameter
> +
> +        General Setup
> +              Class-based Kernel Resource Management --->
> +                     Disk I/O Resource Controller
> +                
> +A later version will fix the use of sched_clock() by ps-iosched.c that is
> +preventing it from being compiled as a module.
> +
> +
> +Using ckrm-io
> +-------------
> +
> +1. Boot into the kernel and mount rcfs
> +
> +# mount -t rcfs none /rcfs 
> +
> +2. Choose a device to bring under ckrm-io's control (it is recommended you
> +choose a disk not hosting your root filesystem until the controller gets 
> tested
> +better). For device hdc, use something like
> +
> +# echo "ps" > /sys/block/hdc/queue/scheduler
> +# cat /sys/block/hdc/queue/scheduler
> +noop anticipatory deadline cfq [ps]
> +
> +
> +3. Verify rcfs root's sectorate
> +
> +# echo /rcfs/taskclass/stats
> +res=io, abs limit 10000
> +/block/hdc/queue skip .. timdout .. avsec .. rate .. sec0 .. sec1 ..
> +
> +"avsec" is the average number of sectors served for the class
> +"rate" is its current limit 
> +The rest of the numbers are of interest in debugging only.
> +
> +
> +4. Launch  I/O workload(s) (dd has been used so far) in a separate terminal.
> +Multiple instances of 
> +
> +# time dd if=/dev/hdc of=/dev/null bs=4096 count=1000000 &
> +
> +5. Watch the "avsec" and "rate" parameters in /rcfs/taskclass (do this in a
> +separate terminal)
> +
> +# while : ; do cat /rcfs/taskclass/stats; sleep 1; done
> +
> +6a. Change the absolute sectorate for the root class
> +
> +# echo "res=io,rootsectorate=1000" > /rcfs/taskclass/config
> +# echo "1000" > /sys/block/hdc/queue/ioscheduler/max_sectorate
> +
> +6b. Verify that "rate" has changed to the new value in the terminal where
> +/rcfs/taskclass/stats is being monitored (step 5)
> +
> +
> +Or just run the I/O workload twice, with different values of sectorate and 
> see
> +the difference in completion times.
> +
> +
> +
> +Current bugs/limitations
> +------------------------
> +
> +- only the root taskclass can be controlled. The shares for children created
> +  under /rcfs/taskclass do not change. 
> +
> +- Having two parameters to modify
> +  "rootsectorate", settable within /rcfs/taskclass/config  and 
> +  "max_sectorate", set as /sys/block/<device>/queue/ioscheduler/max_sectorate
> +
> +could be reduced to one (just the latter). 
> +
> +
> +
> +
> +
> +

The documentation above could be a single patch.  Easier to review
and check in in that context.

> Index: linux-2.6.12-rc3/drivers/block/Kconfig.iosched
> ===================================================================
> --- linux-2.6.12-rc3.orig/drivers/block/Kconfig.iosched
> +++ linux-2.6.12-rc3/drivers/block/Kconfig.iosched
> @@ -38,13 +38,4 @@ config IOSCHED_CFQ
>         among all processes in the system. It should provide a fair
>         working environment, suitable for desktop systems.
>  
> -config IOSCHED_PS
> -     tristate "Proportional share I/O scheduler"
> -     default y
> -     ---help---
> -       The PS I/O scheduler apportions disk I/O bandwidth amongst classes
> -       defined through CKRM (Class-based Kernel Resource Management). It
> -       is based on CFQ but differs in the interface used (CKRM) and 
> -       implementation of differentiated service. 
> -
>  endmenu
> Index: linux-2.6.12-rc3/drivers/block/Makefile
> ===================================================================
> --- linux-2.6.12-rc3.orig/drivers/block/Makefile
> +++ linux-2.6.12-rc3/drivers/block/Makefile
> @@ -13,13 +13,13 @@
>  # kblockd threads
>  #
>  
> -obj-y        := elevator.o ll_rw_blk.o ioctl.o genhd.o scsi_ioctl.o
> +obj-y        := elevator.o ll_rw_blk.o ioctl.o genhd.o scsi_ioctl.o 
>  
>  obj-$(CONFIG_IOSCHED_NOOP)   += noop-iosched.o
>  obj-$(CONFIG_IOSCHED_AS)     += as-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE)       += deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)    += cfq-iosched.o
> -obj-$(CONFIG_IOSCHED_PS)     += ps-iosched.o
> +obj-$(CONFIG_CKRM_RES_BLKIO)    += ckrm-io.o ps-iosched.o
>  obj-$(CONFIG_MAC_FLOPPY)     += swim3.o
>  obj-$(CONFIG_BLK_DEV_FD)     += floppy.o
>  obj-$(CONFIG_BLK_DEV_FD98)   += floppy98.o
> Index: linux-2.6.12-rc3/drivers/block/ckrm-io.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.12-rc3/drivers/block/ckrm-io.c
> @@ -0,0 +1,889 @@
> +/* linux/drivers/block/ckrm_io.c : Block I/O Resource Controller for CKRM
> + *
> + * Copyright (C) Shailabh Nagar, IBM Corp. 2004
> + * 
> + * 
> + * Provides best-effort block I/O bandwidth control for CKRM 
> + * This file provides the CKRM API. The underlying scheduler is the
> + * ps (proportional share) ioscheduler.
> + *
> + * Latest version, more details at http://ckrm.sf.net
> + * 
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + */
> +
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/fs.h>
> +#include <linux/parser.h>
> +#include <linux/kobject.h>
> +#include <asm/errno.h>
> +#include <asm/div64.h>
> +
> +#include <linux/ckrm_tc.h>
> +#include <linux/ckrm-io.h>
> +
> +#define CKI_UNUSED  1

Why?  And what does CKI state for?

> +
> +/* sectorate == 512 byte sectors served in PS_EPOCH ns*/

What does this mean?  How about a little English description on this
one, like "The Sector Rate is ... which is defined as the number of
512 byte sectors (CKI_IOUSAGE_UNIT) transferred in PS_EPOCH nanoseconds."

> +
> +#define CKI_ROOTSECTORATE_DEF        100000
> +#define CKI_MINSECTORATE_DEF 100

Comment?  What are these two magic numbers?

> +
> +#define CKI_IOUSAGE_UNIT     512
> +
> +
> +#if CKI_UNUSED
> +typedef struct ckrm_io_stats{
> +     struct timeval       epochstart ; /* all measurements relative to this 
> +                                          start time */
> +     unsigned long        blksz;  /* size of bandwidth unit */
> +     atomic_t             blkrd;  /* read units submitted to DD */
> +     atomic_t             blkwr; /* write units submitted to DD */
> +
> +} cki_stats_t;          /* per class I/O statistics */
> +#endif

Okay, the name is confusing - why is CKI_UNUSED defined but the term
UNUSED suggests it isn't?  Something is wrong here.

Oh, and the dreaded typedef.  These *have* to go.

> +
> +typedef struct ckrm_io_class {
> +
> +     struct ckrm_core_class *core;
> +     struct ckrm_core_class *parent;
> +     
> +     
> +

Extra blank lines.

> +     struct ckrm_shares shares;
> +     struct rw_semaphore  sem; /* protect rate_list and cnt_*  */
> +     
> +     struct list_head  rate_list;
> +
> +     /* Absolute shares of this class
> +      * in local units. 
> +      */
> +     int cnt_guarantee; /* Allocation as parent */
> +     int cnt_unused;    /* Allocation to default subclass */
> +     int cnt_limit;
> +
> +#ifdef CKI_UNUSED
> +     /* Statistics, for class and default subclass */
> +     cki_stats_t stats; 
> +     cki_stats_t mystats;
> +#endif

This is confusing - either make it real without the ifdef's or
make it go away.  Why is there an ifdef option at all?

More typedef stuff that has to go.

> +} cki_icls_t;
> +
> +/* Internal functions */
> +static inline void cki_reset_stats(cki_stats_t *usg);
> +static inline void init_icls_one(cki_icls_t *icls);
> +static void cki_recalc_propagate(cki_icls_t *res, cki_icls_t *parres);
> +
> +/* Functions from ps_iosched */
> +extern int ps_drop_psq(struct ps_data *psd, unsigned long key);
> +
> +
> +/* CKRM Resource Controller API functions */
> +static void * cki_alloc(struct ckrm_core_class *this,
> +                     struct ckrm_core_class * parent);
> +static void cki_free(void *res);
> +static int cki_setshare(void *res, struct ckrm_shares * shares);
> +static int cki_getshare(void *res, struct ckrm_shares * shares);
> +static int cki_getstats(void *res, struct seq_file *);
> +static int cki_resetstats(void *res);
> +static int cki_showconfig(void *res, struct seq_file *sfile);
> +static int cki_setconfig(void *res, const char *cfgstr);
> +static void cki_chgcls(void *tsk, void *oldres, void *newres);
> +
> +/* Global data */
> +struct ckrm_res_ctlr cki_rcbs;
> +
> +struct cki_data ckid;
> +EXPORT_SYMBOL_GPL(ckid);
> +
> +struct ps_rate cki_def_psrate; 
> +EXPORT_SYMBOL_GPL(cki_def_psrate);
> +
> +struct rw_semaphore psdlistsem;
> +EXPORT_SYMBOL(psdlistsem);

Why not EXPORT_SYMBOL_GPL?

> +
> +LIST_HEAD(ps_psdlist);
> +EXPORT_SYMBOL(ps_psdlist);

Why not EXPORT_SYMBOL_GPL?

> +
> +
> +static struct psdrate *cki_find_rate(struct ckrm_io_class *icls,
> +                                  struct ps_data *psd)
> +{
> +     struct psdrate *prate;
> +     
> +     down_read(&icls->sem);
> +     list_for_each_entry(prate, &icls->rate_list, rate_list) {
> +             if (prate->psd == psd)
> +                     goto found;
> +     }
> +     prate = NULL;
> +found:
> +     up_read(&icls->sem);
> +     return prate;
> +}
> +
> +/* Exported functions */
> +
> +void cki_set_sectorate(cki_icls_t *icls, int sectorate)
> +{
> +     struct psdrate *prate;
> +     u64 temp;
> +     
> +     down_read(&icls->sem);
> +     list_for_each_entry(prate, &icls->rate_list, rate_list) {
> +             temp = (u64) sectorate * prate->psd->ps_max_sectorate;
> +             do_div(temp,ckid.rootsectorate);
> +             atomic_set(&prate->psrate.sectorate,temp);
> +     }       
> +     up_read(&icls->sem);
> +}
> +
> +/* Reset psdrate entries in icls for all current psd's 
> + * Called after a class's absolute shares change 
> + */
> +void cki_reset_sectorate(cki_icls_t *icls)
> +{
> +     struct psdrate *prate;
> +     u64 temp;
> +     
> +     down_read(&icls->sem);
> +     list_for_each_entry(prate, &icls->rate_list, rate_list) {
> +
> +             if (icls->cnt_limit != CKRM_SHARE_DONTCARE) {
> +                     temp = (u64) icls->cnt_limit * 
> prate->psd->ps_max_sectorate;
> +                     do_div(temp,ckid.rootsectorate);
> +             } else 
> +                     temp = prate->psd->ps_min_sectorate;
> +             atomic_set(&prate->psrate.sectorate,temp);
> +     }       
> +     up_read(&icls->sem);
> +
> +}
> +
> +struct psdrate *dbprate;
> +
> +int cki_psdrate_init(struct ckrm_io_class *icls, struct ps_data *psd)
> +{
> +     struct psdrate *prate;
> +     u64 temp;
> +
> +     prate = kmalloc(sizeof(struct psdrate),GFP_KERNEL);

Space after comma.

> +     if (!prate) 
> +             return -ENOMEM;
> +     
> +     INIT_LIST_HEAD(&prate->rate_list);
> +     prate->psd = psd;
> +     memset(&prate->psrate,0,sizeof(prate->psrate));
> +     
> +     dbprate = prate;
> +     if (icls->cnt_limit != CKRM_SHARE_DONTCARE) {
> +             temp = (u64) icls->cnt_limit * psd->ps_max_sectorate;
> +             do_div(temp,ckid.rootsectorate);
> +     } else { 
> +             temp = psd->ps_min_sectorate;
> +     }
> +     atomic_set(&prate->psrate.sectorate,temp);
> +     
> +     down_write(&icls->sem);
> +     list_add(&prate->rate_list,&icls->rate_list);

Space after comma.

> +     up_write(&icls->sem);
> +
> +     return 0;
> +}
> +
> +int cki_psdrate_del(struct ckrm_io_class *icls, struct ps_data *psd)
> +{
> +     struct psdrate *prate;
> +
> +     prate = cki_find_rate(icls, psd);
> +     if (!prate) 
> +             return 0;
> +
> +     down_write(&icls->sem);
> +     list_del(&prate->rate_list);
> +     up_write(&icls->sem);
> +
> +     kfree(prate);
> +     return 0;
> +}

All return 0, why not void?

> +
> +
> +/* Create psdrate entries in icls for all current psd's */
> +void cki_rates_init(cki_icls_t *icls)
> +{
> +     struct psd_list_entry *psdl;
> +     
> +     down_read(&psdlistsem);
> +     list_for_each_entry(psdl,&ps_psdlist,psd_list) { 
> +             if (cki_psdrate_init(icls, psdl->psd)) {
> +                     printk(KERN_WARNING "%s: psdrate addition failed\n",
> +                            __FUNCTION__);
> +                     continue;
> +             }
> +     }
> +     up_read(&psdlistsem);
> +}
> +
> +/* Free all psdrate entries in icls */
> +void cki_rates_del(cki_icls_t *icls)
> +{
> +     struct psdrate *prate, *tmp;
> +     
> +     down_write(&icls->sem);
> +     list_for_each_entry_safe(prate, tmp, &icls->rate_list, rate_list) {
> +         list_del(&prate->rate_list);
> +         kfree(prate);
> +     }
> +     up_write(&icls->sem);
> +/*   
> +     down_read(&psdlistsem);
> +     list_for_each_entry(psdl,&ps_psdlist,psd_list) { 
> +             cki_psdrate_del(icls,psdl->psd);
> +     }
> +     up_read(&psdlistsem);
> +*/

Why is this commented out?  Remove it - easier to read.

> +}
> +
> +/* Called from ps-iosched.c when it initializes a new ps_data
> + *  as part of starting to manage a new device request queue 
> + */
> +
> +int cki_psd_init(struct ps_data *psd)
> +{
> +     struct ckrm_classtype *ctype = 
> ckrm_classtypes[CKRM_CLASSTYPE_TASK_CLASS];
> +     struct ckrm_core_class *core;
> +     struct ckrm_io_class *icls;
> +     struct psdrate *prate;
> +     int ret=-ENOMEM;
> +
> +     /* Set psd's min and max sectorate from default values */
> +     psd->ps_max_sectorate = ckid.rootsectorate;
> +     psd->ps_min_sectorate = ckid.minsectorate;
> +
> +     down_read(&ckrm_class_sem);
> +     list_for_each_entry(core, &ctype->classes, clslist) {
> +             icls = ckrm_get_res_class(core, cki_rcbs.resid, cki_icls_t);
> +             if (!icls)
> +                     continue;
> +
> +             prate = cki_find_rate(icls, psd);
> +             if (prate) 
> +                     continue;
> +
> +             if (cki_psdrate_init(icls, psd)) {
> +                     printk(KERN_WARNING "%s: psdrate addition failed\n",
> +                            __FUNCTION__);
> +                     continue;
> +             }
> +     }
> +     ret = 0;
> +
> +     up_read(&ckrm_class_sem);
> +     return ret;
> +}
> +EXPORT_SYMBOL_GPL(cki_psd_init);
> +
> +/* Called whenever ps-iosched frees a ps_data 
> + *  as part of ending management of a device request queue 
> + */
> +
> +int cki_psd_del(struct ps_data *psd)
> +{
> +     struct ckrm_classtype *ctype = 
> ckrm_classtypes[CKRM_CLASSTYPE_TASK_CLASS];
> +     struct ckrm_core_class *core;
> +     struct ckrm_io_class *icls;
> +     int ret = 0;
> +
> +     down_read(&ckrm_class_sem);
> +     list_for_each_entry(core, &ctype->classes, clslist) {
> +             icls = ckrm_get_res_class(core, cki_rcbs.resid, cki_icls_t);
> +             if (!icls)
> +                     continue;
> +
> +             if (cki_psdrate_del(icls,psd)) {
> +                     printk(KERN_WARNING "%s: psdrate deletion failed\n",
> +                            __FUNCTION__);
> +                     continue;
> +             }
> +     }
> +     up_read(&ckrm_class_sem);
> +     return ret;
> +}
> +EXPORT_SYMBOL_GPL(cki_psd_del);
> +
> +struct ps_rate *cki_tsk_psrate(struct ps_data *psd, struct task_struct *tsk)
> +{
> +     cki_icls_t *icls;
> +     struct psdrate *prate;
> +
> +     icls = ckrm_get_res_class(class_core(tsk->taskclass),
> +                               cki_rcbs.resid, cki_icls_t);
> +     if (!icls)
> +             return NULL;
> +     
> +     
> +     prate = cki_find_rate(icls,psd);
> +     if (prate)
> +         return &(prate->psrate);
> +     else
> +         return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cki_tsk_psrate);                   
> +
> +/* Exported functions end */
> +
> +
> +#ifdef CKI_UNUSED
> +static inline void cki_reset_stats(cki_stats_t *stats)
> +{
> +     if (stats) {
> +             atomic_set(&stats->blkrd,0);
> +             atomic_set(&stats->blkwr,0);
> +     }
> +}
> +
> +static inline void init_icls_stats(cki_icls_t *icls)
> +{
> +     struct timeval tv;
> +
> +     do_gettimeofday(&tv);
> +     icls->stats.epochstart = icls->mystats.epochstart = tv;
> +     icls->stats.blksz = icls->mystats.blksz = CKI_IOUSAGE_UNIT;
> +     cki_reset_stats(&icls->stats);
> +     cki_reset_stats(&icls->mystats);
> +}    
> +#endif

Again, this bizare macro.

> +
> +/* Initialize icls to default values 
> + * No other classes touched, locks not reinitialized.
> + */
> +
> +static inline void init_icls_one(cki_icls_t *icls)
> +{
> +     /* Zero initial guarantee for scalable creation of
> +        multiple classes */
> +
> +     /* Try out a new set */
> +     
> +     icls->shares.my_guarantee = CKRM_SHARE_DONTCARE;
> +     icls->shares.my_limit = CKRM_SHARE_DONTCARE;
> +     icls->shares.total_guarantee = CKRM_SHARE_DFLT_TOTAL_GUARANTEE;
> +     icls->shares.max_limit = CKRM_SHARE_DFLT_MAX_LIMIT;
> +     icls->shares.unused_guarantee = icls->shares.total_guarantee;
> +     icls->shares.cur_max_limit = 0;
> +
> +     icls->cnt_guarantee = CKRM_SHARE_DONTCARE;
> +     icls->cnt_unused = CKRM_SHARE_DONTCARE;
> +     icls->cnt_limit = CKRM_SHARE_DONTCARE;
> +
> +     INIT_LIST_HEAD(&icls->rate_list);
> +#ifdef CKI_UNUSED    
> +     init_icls_stats(icls);
> +#endif

And again?

> +}
> +
> +/* Initialize root's psd entries */
> +static void cki_createrootrate(cki_icls_t *root, int sectorate)
> +{
> +     down_write(&root->sem);
> +     root->cnt_guarantee = sectorate;
> +     root->cnt_unused = sectorate;
> +     root->cnt_limit = sectorate;
> +     up_write(&root->sem);
> +
> +     cki_rates_init(root);
> +}
> +
> +/* Called with root->share_lock held  */
> +static void cki_setrootrate(cki_icls_t *root, int sectorate)
> +{
> +     down_write(&root->sem);
> +     root->cnt_guarantee = sectorate;
> +     root->cnt_unused = sectorate;
> +     root->cnt_limit = sectorate;
> +     up_write(&root->sem);
> +
> +     cki_reset_sectorate(root);
> +}
> +
> +static void cki_put_psq(cki_icls_t *icls)
> +{
> +     struct psdrate *prate;
> +     struct ckrm_task_class *tskcls;
> +     
> +     down_read(&icls->sem);
> +     list_for_each_entry(prate, &icls->rate_list, rate_list) {
> +             tskcls = container_of(icls->core,struct ckrm_task_class, core);
> +             if (ps_drop_psq(prate->psd,(unsigned long)tskcls)) {
> +                     printk(KERN_WARNING "%s: ps_icls_free failed\n",
> +                            __FUNCTION__);
> +                     continue;
> +             }
> +     }
> +     up_read(&icls->sem);
> +}
> +
> +static void *cki_alloc(struct ckrm_core_class *core,
> +                      struct ckrm_core_class *parent)
> +{
> +     cki_icls_t *icls;
> +     
> +     icls = kmalloc(sizeof(cki_icls_t), GFP_ATOMIC);
> +     if (!icls) {
> +             printk(KERN_ERR "cki_res_alloc failed GFP_ATOMIC\n");
> +             return NULL;
> +     }
> +
> +     memset(icls, 0, sizeof(cki_icls_t));
> +     icls->core = core;
> +     icls->parent = parent;
> +     init_rwsem(&icls->sem);
> +
> +     init_icls_one(icls);
> +
> +     if (parent == NULL) 
> +             /* No need to acquire root->share_lock */
> +             cki_createrootrate(icls, ckid.rootsectorate);
> +     
> +     
> +     try_module_get(THIS_MODULE);
> +     return icls;
> +}
> +
> +static void cki_free(void *res)
> +{
> +     cki_icls_t *icls = res, *parres, *childres;
> +     struct ckrm_core_class *child = NULL;
> +     int maxlimit, resid = cki_rcbs.resid;
> +
> +     
> +     if (!res)
> +             return;
> +
> +     /* Deallocate CFQ queues */
> +
> +     /* Currently CFQ queues are deallocated when empty. Since no task 
> +      * should belong to this icls, no new requests will get added to the
> +      * CFQ queue. 
> +      * 
> +      * When CFQ switches to persistent queues, call its "put" function
> +      * so it gets deallocated after the last pending request is serviced.
> +      *
> +      */

Can delete the blank comment line

> +
> +     parres = ckrm_get_res_class(icls->parent, resid, cki_icls_t);
> +     if (!parres) {
> +             printk(KERN_ERR "cki_free: error getting "
> +                    "resclass from core \n");
> +             return;
> +     }
> +
> +     /* Update parent's shares */
> +     down_write(&parres->sem);
> +
> +     child_guarantee_changed(&parres->shares, icls->shares.my_guarantee, 0);
> +     parres->cnt_unused += icls->cnt_guarantee;
> +
> +     // run thru parent's children and get the new max_limit of the parent
> +     ckrm_lock_hier(parres->core);
> +     maxlimit = 0;
> +     while ((child = ckrm_get_next_child(parres->core, child)) != NULL) {
> +             childres = ckrm_get_res_class(child, resid, cki_icls_t);
> +             if (maxlimit < childres->shares.my_limit) {
> +                     maxlimit = childres->shares.my_limit;
> +             }
> +     }
> +     ckrm_unlock_hier(parres->core);
> +     if (parres->shares.cur_max_limit < maxlimit) {
> +             parres->shares.cur_max_limit = maxlimit;
> +     }
> +     up_write(&parres->sem);
> +
> +     /* Drop refcounts on all psq's corresponding to this class */
> +     cki_put_psq(icls);
> +     
> +     cki_rates_del(icls);
> +
> +     kfree(res);
> +     module_put(THIS_MODULE);
> +     return;
> +}
> +
> +
> +/* Recalculate absolute shares from relative
> + * Caller should hold a lock on icls
> + */
> +
> +static void cki_recalc_propagate(cki_icls_t *res, cki_icls_t *parres)
> +{
> +
> +     struct ckrm_core_class *child = NULL;
> +     cki_icls_t *childres;
> +     int resid = cki_rcbs.resid;
> +     u64 temp;
> +
> +     if (parres) {
> +             struct ckrm_shares *par = &parres->shares;
> +             struct ckrm_shares *self = &res->shares;
> +
> +
> +             if (parres->cnt_guarantee == CKRM_SHARE_DONTCARE) {
> +                     res->cnt_guarantee = CKRM_SHARE_DONTCARE;
> +             } else if (par->total_guarantee) {
> +                     temp = (u64) self->my_guarantee * 
> +                             parres->cnt_guarantee;
> +                     do_div(temp, par->total_guarantee);
> +                     res->cnt_guarantee = (int) temp;
> +             } else {
> +                     res->cnt_guarantee = 0;
> +             }
> +
> +
> +             if (parres->cnt_limit == CKRM_SHARE_DONTCARE) {
> +                     res->cnt_limit = CKRM_SHARE_DONTCARE;
> +                     cki_set_sectorate(res,ckid.minsectorate);
> +             } else {
> +                     if (par->max_limit) {
> +                             temp = (u64) self->my_limit * 
> +                                     parres->cnt_limit;
> +                             do_div(temp, par->max_limit);
> +                             res->cnt_limit = (int) temp;
> +                     } else {
> +                             res->cnt_limit = 0;
> +                     }
> +                     cki_set_sectorate(res,res->cnt_limit);
> +             }
> +             
> +             if (res->cnt_guarantee == CKRM_SHARE_DONTCARE) {
> +                     res->cnt_unused = CKRM_SHARE_DONTCARE;
> +             } else {
> +                     if (self->total_guarantee) {
> +                             temp = (u64) self->unused_guarantee * 
> +                                     res->cnt_guarantee;
> +                             do_div(temp, self->total_guarantee);
> +                             res->cnt_unused = (int) temp;
> +                     } else {
> +                             res->cnt_unused = 0;
> +                     }
> +
> +             }
> +             
> +     }
> +     // propagate to children
> +     ckrm_lock_hier(res->core);
> +     while ((child = ckrm_get_next_child(res->core,child)) != NULL){
> +             childres = ckrm_get_res_class(child, resid, 
> +                                           cki_icls_t);
> +             
> +             down_write(&childres->sem);
> +             cki_recalc_propagate(childres, res);
> +             up_write(&childres->sem);
> +     }
> +     ckrm_unlock_hier(res->core);
> +}
> +
> +
> +static int cki_setshare(void *res, struct ckrm_shares *new)
> +{
> +     cki_icls_t *icls = res, *parres;
> +     struct ckrm_shares *cur, *par;
> +     int rc = -EINVAL, resid = cki_rcbs.resid;
> +
> +     if (!icls) 
> +             return rc;
> +
> +     cur = &icls->shares; 
> +     if (icls->parent) {
> +             parres =
> +                 ckrm_get_res_class(icls->parent, resid, cki_icls_t);
> +             if (!parres) {
> +                     pr_debug("cki_setshare: invalid resclass\n");
> +                     return -EINVAL;
> +             }
> +             down_write(&parres->sem);
> +             down_write(&icls->sem);
> +             par = &parres->shares;
> +     } else {
> +             down_write(&icls->sem);
> +             parres = NULL;
> +             par = NULL;
> +     }
> +
> +     rc = set_shares(new, cur, par);
> +
> +     if ((!rc) && parres) {
> +             if (parres->cnt_guarantee == CKRM_SHARE_DONTCARE) {
> +                     parres->cnt_unused = CKRM_SHARE_DONTCARE;
> +             } else if (par->total_guarantee) {
> +                     u64 temp = (u64) par->unused_guarantee * 
> +                             parres->cnt_guarantee;
> +                     do_div(temp, par->total_guarantee);
> +                     parres->cnt_unused = (int) temp;
> +             } else {
> +                     parres->cnt_unused = 0;
> +             }
> +             cki_recalc_propagate(res, parres);
> +     }
> +     up_write(&icls->sem);
> +     if (icls->parent) {
> +             up_write(&parres->sem);
> +     }
> +     return rc;
> +}
> +
> +static int cki_getshare(void *res, struct ckrm_shares * shares)
> +{
> +     cki_icls_t *icls = res;
> +
> +     if (!icls)
> +             return -EINVAL;
> +     *shares = icls->shares;
> +     return 0;
> +}
> +
> +static int cki_getstats(void *res, struct seq_file *sfile)
> +{
> +     cki_icls_t *icls = res;
> +     struct psdrate *prate;
> +     char *path;
> +             
> +
> +     if (!icls)
> +             return -EINVAL;
> +
> +     seq_printf(sfile, "res=%s, abs limit %d\n",cki_rcbs.res_name,
> +                icls->cnt_limit);
> +
> +     down_read(&icls->sem);
> +     list_for_each_entry(prate, &icls->rate_list, rate_list) {
> +             path = kobject_get_path(&prate->psd->queue->kobj, GFP_KERNEL);
> +             seq_printf(sfile,"%s skip %d timdout %d avsec %lu rate %d"
> +                        " sec0 %lu sec1 %lu\n",
> +                        path,
> +                        prate->psrate.nskip,
> +                        prate->psrate.timedout,
> +                        prate->psrate.navsec,
> +                        atomic_read(&(prate->psrate.sectorate)),
> +                        (unsigned long)prate->psrate.sec[0],
> +                        (unsigned long)prate->psrate.sec[1]);
> +             kfree(path);
> +     }
> +     up_read(&icls->sem);
> +     return 0;
> +}
> +
> +static int cki_resetstats(void *res)
> +{
> +     cki_icls_t *icls = res;
> +
> +     if (!res)
> +             return -EINVAL;
> +     
> +     init_icls_stats(icls);
> +     return 0;
> +}
> +
> +static void cki_chgcls(void *tsk, void *oldres, void *newres)
> +{
> +     /* cki_icls_t *oldicls = oldres, *newicls = newres; */
> +     
> +     /* Nothing needs to be done 
> +      * Future requests from task will go to the new class's psq
> +      * Old ones will continue to get satisfied from the original psq
> +      * 
> +      */
> +     return;
> +}
> +
> +enum iocfg_token_t {
> +     ROOTRATE, MINRATE, IOCFGERR
> +};
> +
> +/* Token matching for parsing input to this magic file */
> +static match_table_t iocfg_tokens = {
> +     {ROOTRATE, "rootsectorate=%d"},
> +     {MINRATE,"minsectorate=%d"},
> +     {IOCFGERR, NULL}
> +};
> +
> +static int cki_recalc_abs(void)
> +{
> +     struct ckrm_core_class *root;
> +     cki_icls_t *icls;
> +
> +     root = (cki_rcbs.classtype)->default_class;
> +     icls = ckrm_get_res_class(root, cki_rcbs.resid, cki_icls_t);
> +     if (!icls)
> +             return -EINVAL;
> +
> +     down_write(&icls->sem);
> +     cki_recalc_propagate(icls, NULL);
> +     up_write(&icls->sem);
> +     return 0;
> +}
> +
> +     
> +
> +
> +static int cki_showconfig(void *res, struct seq_file *sfile)
> +{
> +     cki_icls_t *icls = res;
> +     struct cki_data tmp;
> +
> +     if (!icls)
> +             return -EINVAL;
> +
> +     spin_lock(&ckid.cfglock);
> +     tmp = ckid;
> +     spin_unlock(&ckid.cfglock);
> +
> +     seq_printf(sfile, "rootsectorate = %d, minsectorate = %d\n",
> +                tmp.rootsectorate,
> +                tmp.minsectorate);
> +     return 0;
> +}
> +     
> +static int cki_setconfig(void *res, const char *cfgstr)
> +{
> +     char *p, *inpstr = cfgstr;
> +     int tmp,rc = -EINVAL;
> +     cki_icls_t *rooticls;
> +
> +
> +     if (!cfgstr)
> +             return -EINVAL;
> +     
> +     while ((p = strsep(&inpstr, ",")) != NULL) {
> +
> +             substring_t args[MAX_OPT_ARGS];
> +             int token;
> +
> +             
> +             if (!*p)
> +                     continue;
> +             
> +             token = match_token(p, iocfg_tokens, args);
> +             switch (token) {
> +
> +             case ROOTRATE: 
> +                     if (match_int(args, &tmp))
> +                             return -EINVAL;
> +
> +                     if (tmp < 0)
> +                             return -EINVAL;
> +
> +                     spin_lock(&(ckid.cfglock));
> +                     ckid.rootsectorate = tmp;
> +                     spin_unlock(&(ckid.cfglock));
> +                     
> +                     rooticls = ckrm_get_res_class(
> +                             (cki_rcbs.classtype)->default_class, 
> +                             cki_rcbs.resid, cki_icls_t);
> +
> +                     cki_setrootrate(rooticls,tmp);
> +                     /* update absolute shares treewide */
> +                     rc = cki_recalc_abs();
> +                     if (rc)
> +                             return rc;
> +                     break;
> +
> +             case MINRATE:
> +                     if (match_int(args, &tmp))
> +                             return -EINVAL;
> +
> +                     spin_lock(&(ckid.cfglock));
> +                     if (tmp <= 0 || tmp > ckid.rootsectorate) {
> +                             spin_unlock(&(ckid.cfglock));
> +                             return -EINVAL;
> +                     }
> +                     ckid.minsectorate = tmp;
> +                     spin_unlock(&(ckid.cfglock));
> +                     
> +                     /* update absolute shares treewide */
> +                     rc = cki_recalc_abs();
> +                     if (rc)
> +                             return rc;
> +                     break;
> +
> +             default:
> +                     return -EINVAL;
> +
> +             }
> +     }
> +
> +     return rc;
> +}
> +
> +
> +
> +
> +
> +struct ckrm_res_ctlr cki_rcbs = {
> +     .res_name = "io",
> +     .res_hdepth = 1,
> +     .resid = -1,
> +     .res_alloc = cki_alloc,
> +     .res_free = cki_free,
> +     .set_share_values = cki_setshare,
> +     .get_share_values = cki_getshare,
> +     .get_stats = cki_getstats,
> +     .reset_stats = cki_resetstats,
> +     .show_config = cki_showconfig,
> +     .set_config = cki_setconfig,
> +     .change_resclass = cki_chgcls,
> +};
> +
> +
> +void __exit cki_exit(void)
> +{
> +     ckrm_unregister_res_ctlr(&cki_rcbs);
> +     cki_rcbs.resid = -1;
> +     cki_rcbs.classtype = NULL; 
> +}
> +
> +int __init cki_init(void)
> +{
> +     struct ckrm_classtype *clstype;
> +     int resid = cki_rcbs.resid;
> +
> +     if (resid != -1) 
> +             return 0;
> +
> +     clstype = ckrm_find_classtype_by_name("taskclass");
> +     if (clstype == NULL) {
> +             printk(KERN_WARNING "%s: classtype<taskclass> not found\n",
> +                    __FUNCTION__);
> +             return -ENOENT;
> +     }
> +
> +     ckid.cfglock = SPIN_LOCK_UNLOCKED;
> +     ckid.rootsectorate = CKI_ROOTSECTORATE_DEF;
> +     ckid.minsectorate = CKI_MINSECTORATE_DEF;
> +
> +     atomic_set(&cki_def_psrate.sectorate,0);
> +     init_rwsem(&psdlistsem);
> +     
> +     resid = ckrm_register_res_ctlr(clstype, &cki_rcbs);
> +     if (resid == -1) 
> +             return -ENOENT;
> +
> +     cki_rcbs.classtype = clstype;
> +     return 0;
> +}
> +     
> +
> +module_init(cki_init)
> +module_exit(cki_exit)
> +
> +MODULE_AUTHOR("Shailabh Nagar <[EMAIL PROTECTED]>");
> +MODULE_DESCRIPTION("CKRM Disk I/O Resource Controller");
> +MODULE_LICENSE("GPL");
> +
> Index: linux-2.6.12-rc3/drivers/block/ps-iosched.c
> ===================================================================
> --- linux-2.6.12-rc3.orig/drivers/block/ps-iosched.c
> +++ linux-2.6.12-rc3/drivers/block/ps-iosched.c
> @@ -22,7 +22,8 @@
>  #include <linux/compiler.h>
>  #include <linux/hash.h>
>  #include <linux/rbtree.h>
> -#include <linux/mempool.h>
> +#include <linux/ckrm-io.h>
> +#include <asm/div64.h>
>  
>  static unsigned long max_elapsed_prq;
>  static unsigned long max_elapsed_dispatch;
> @@ -39,6 +40,10 @@ static int ps_fifo_rate = HZ / 8;  /* fif
>  static int ps_back_max = 16 * 1024;  /* maximum backwards seek, in KiB */
>  static int ps_back_penalty = 2;      /* penalty of a backwards seek */
>  
> +#define PS_EPOCH             1000000000
> +#define PS_HMAX_PCT          80
> +
> +
>  /*
>   * for the hash of psq inside the psd
>   */
> @@ -90,53 +95,20 @@ enum {
>       PS_KEY_TGID,
>       PS_KEY_UID,
>       PS_KEY_GID,
> +     PS_KEY_TASKCLASS,
>       PS_KEY_LAST,
>  };
>  
> -static char *ps_key_types[] = { "pgid", "tgid", "uid", "gid", NULL };
> +
> +
> +static char *ps_key_types[] = { "pgid", "tgid", "uid", "gid", "taskclass", 
> NULL };
>  
>  static kmem_cache_t *prq_pool;
>  static kmem_cache_t *ps_pool;
>  static kmem_cache_t *ps_ioc_pool;
>  
> -struct ps_data {
> -     struct list_head rr_list;
> -     struct list_head empty_list;
> -
> -     struct hlist_head *ps_hash;
> -     struct hlist_head *prq_hash;
> -
> -     /* queues on rr_list (ie they have pending requests */
> -     unsigned int busy_queues;
> -
> -     unsigned int max_queued;
> -
> -     atomic_t ref;
> -
> -     int key_type;
> -
> -     mempool_t *prq_pool;
> -
> -     request_queue_t *queue;
> -
> -     sector_t last_sector;
> -
> -     int rq_in_driver;
> -
> -     /*
> -      * tunables, see top of file
> -      */
> -     unsigned int ps_quantum;
> -     unsigned int ps_queued;
> -     unsigned int ps_fifo_expire_r;
> -     unsigned int ps_fifo_expire_w;
> -     unsigned int ps_fifo_batch_expire;
> -     unsigned int ps_back_penalty;
> -     unsigned int ps_back_max;
> -     unsigned int find_best_prq;
> -
> -     unsigned int ps_tagged;
> -};
> +extern struct rw_semaphore psdlistsem;
> +extern struct list_head ps_psdlist;
>  
>  struct ps_queue {
>       /* reference count */
> @@ -175,6 +147,22 @@ struct ps_queue {
>       int in_flight;
>       /* number of currently allocated requests */
>       int alloc_limit[2];
> +
> +     /* limit related settings/stats */
> +     struct ps_rate *psrate; 
> +
> +     u64 epstart;            /* current epoch's starting timestamp (ns) */
> +     u64 epsector[2];        /* Total sectors dispatched in [0] previous
> +                              * and [1] current epoch
> +                              */
> +     unsigned long avsec;    /* avg sectors dispatched/epoch */
> +     int skipped;            /* queue skipped at last dispatch ? */
> +
> +     /* Per queue timer to suspend/resume queue from processing */
> +     struct timer_list timer;
> +     unsigned long wait_end;
> +     unsigned long flags;
> +     struct work_struct work;
>  };
>  
>  struct ps_rq {
> @@ -200,6 +188,7 @@ static void ps_dispatch_sort(request_que
>  static void ps_update_next_prq(struct ps_rq *);
>  static void ps_put_psd(struct ps_data *psd);
>  
> +
>  /*
>   * what the fairness is based on (ie how processes are grouped and
>   * differentiated)
> @@ -220,6 +209,8 @@ ps_hash_key(struct ps_data *psd, struct 
>                       return tsk->uid;
>               case PS_KEY_GID:
>                       return tsk->gid;
> +             case PS_KEY_TASKCLASS:
> +                     return (unsigned long) class_core(tsk->taskclass);
>       }
>  }
>  
> @@ -722,6 +713,81 @@ ps_merged_requests(request_queue_t *q, s
>       ps_remove_request(q, next);
>  }
>  
> +
> +/* Over how many ns is sectorate defined */
> +#define NS4SCALE  (100000000)
> +
> +struct ps_rq *dbprq;
> +struct ps_queue *dbpsq;
> +unsigned long dbsectorate;
> +
> +static void __ps_check_limit(struct ps_data *psd,struct ps_queue *psq, int 
> dontskip)
> +{
> +     struct ps_rq *prq;
> +     unsigned long long ts, gap, epoch, tmp;
> +     unsigned long newavsec, sectorate;
> +
> +     prq = rb_entry_prq(rb_first(&psq->sort_list));
> +
> +     dbprq = prq;
> +     dbpsq = psq;
> +
> +     ts = sched_clock();
> +     gap = ts - psq->epstart;
> +     epoch = psd->ps_epoch;
> +
> +     sectorate = atomic_read(&psq->psrate->sectorate);
> +     dbsectorate = sectorate;
> +
> +     if ((gap >= epoch) || (gap < 0)) {
> +
> +             if (gap >= (epoch << 1)) {
> +                     psq->epsector[0] = 0;
> +                     psq->epstart = ts ; 
> +             } else {
> +                     psq->epsector[0] = psq->epsector[1];
> +                     psq->epstart += epoch;
> +             } 
> +             psq->epsector[1] = 0;
> +             gap = ts - psq->epstart;
> +
> +             tmp  = (psq->epsector[0] + prq->request->nr_sectors) * NS4SCALE;
> +             do_div(tmp,epoch+gap);
> +
> +             psq->avsec = (unsigned long)tmp;
> +             psq->skipped = 0;
> +             psq->epsector[1] += prq->request->nr_sectors;
> +             
> +             psq->psrate->navsec = psq->avsec;
> +             psq->psrate->sec[0] = psq->epsector[0];
> +             psq->psrate->sec[1] = psq->epsector[1];
> +             psq->psrate->timedout++;
> +             return;
> +     } else {
> +             
> +             tmp = (psq->epsector[0] + psq->epsector[1] + 
> +                    prq->request->nr_sectors) * NS4SCALE;
> +             do_div(tmp,epoch+gap);
> +
> +             newavsec = (unsigned long)tmp;
> +             if ((newavsec < sectorate) || dontskip) {
> +                     psq->avsec = newavsec ;
> +                     psq->skipped = 0;
> +                     psq->epsector[1] += prq->request->nr_sectors;
> +                     psq->psrate->navsec = psq->avsec;
> +                     psq->psrate->sec[1] = psq->epsector[1];
> +             } else {
> +                     psq->skipped = 1;
> +                     /* pause q's processing till avsec drops to 
> +                        ps_hmax_pct % of its value */
> +                     tmp = (epoch+gap) * (100-psd->ps_hmax_pct);
> +                     do_div(tmp,1000000*psd->ps_hmax_pct);
> +                     psq->wait_end = jiffies+msecs_to_jiffies(tmp);
> +             }
> +     }       
> +}
> +
> +
>  /*
>   * we dispatch psd->ps_quantum requests in total from the rr_list queues,
>   * this function sector sorts the selected request to minimize seeks. we 
> start
> @@ -823,7 +889,7 @@ static int ps_dispatch_requests(request_
>       struct ps_data *psd = q->elevator->elevator_data;
>       struct ps_queue *psq;
>       struct list_head *entry, *tmp;
> -     int queued, busy_queues, first_round;
> +     int queued, busy_queues, first_round, busy_unlimited;
>  
>       if (list_empty(&psd->rr_list))
>               return 0;
> @@ -831,24 +897,36 @@ static int ps_dispatch_requests(request_
>       queued = 0;
>       first_round = 1;
>  restart:
> +     busy_unlimited = 0;
>       busy_queues = 0;
>       list_for_each_safe(entry, tmp, &psd->rr_list) {
>               psq = list_entry_psq(entry);
>  
>               BUG_ON(RB_EMPTY(&psq->sort_list));
> +             busy_queues++;
> +             
> +             if (first_round || busy_unlimited)
> +                     __ps_check_limit(psd,psq,0);
> +             else
> +                     __ps_check_limit(psd,psq,1);
>  
> -             /*
> -              * first round of queueing, only select from queues that
> -              * don't already have io in-flight
> -              */
> -             if (first_round && psq->in_flight)
> +             if (psq->skipped) {
> +                     psq->psrate->nskip++;
> +                     busy_queues--;
> +                     if (time_before(jiffies, psq->wait_end)) {
> +                             list_del(&psq->ps_list);
> +                             mod_timer(&psq->timer,psq->wait_end);
> +                     }
>                       continue;
> +             }
> +             busy_unlimited++;
>  
>               ps_dispatch_request(q, psd, psq);
>  
> -             if (!RB_EMPTY(&psq->sort_list))
> -                     busy_queues++;
> -
> +             if (RB_EMPTY(&psq->sort_list)) {
> +                     busy_unlimited--;
> +                     busy_queues--;
> +             }
>               queued++;
>       }
>  
> @@ -856,6 +934,19 @@ restart:
>               first_round = 0;
>               goto restart;
>       }
> +#if 0
> +     } else {
> +             /*
> +              * if we hit the queue limit, put the string of serviced
> +              * queues at the back of the pending list
> +              */
> +             struct list_head *prv = nxt->prev;
> +             if (prv != plist) {
> +                     list_del(plist);
> +                     list_add(plist, prv);
> +             }
> +     }
> +#endif
>  
>       return queued;
>  }
> @@ -961,6 +1052,25 @@ dispatch:
>       return NULL;
>  }
>  
> +void ps_set_sectorate(struct ckrm_core_class *core, int sectorate)
> +{
> +     struct ps_data *psd;
> +     struct ps_queue *psq;
> +     u64 temp;
> +
> +     down_read(&psdlistsem);
> +     list_for_each_entry(psd, &ps_psdlist, psdlist) {
> +             psq = ps_find_ps_hash(psd,(unsigned int)core);
> +             
> +             temp = (u64) sectorate * psd->ps_max_sectorate;
> +             do_div(temp,ckid.rootsectorate);
> +
> +             atomic_set(&psq->psrate->sectorate, temp);
> +     }
> +     up_read(&psdlistsem);
> +}
> +
> +
>  /*
>   * task holds one reference to the queue, dropped when task exits. each prq
>   * in-flight on this queue also holds a reference, dropped when prq is freed.
> @@ -1186,6 +1296,29 @@ err:
>       return NULL;
>  }
>  
> +
> +static void ps_pauseq_timer(unsigned long data)
> +{
> +     struct ps_queue *psq = (struct ps_queue *) data;
> +     kblockd_schedule_work(&psq->work);
> +}
> +
> +static void ps_pauseq_work(void *data)
> +{
> +     struct ps_queue *psq = (struct ps_queue *) data;
> +     struct ps_data *psd = psq->psd;
> +     request_queue_t *q = psd->queue;
> +     unsigned long flags;
> +     
> +     spin_lock_irqsave(q->queue_lock, flags);
> +     list_add_tail(&psq->ps_list,&psd->rr_list);
> +     psq->skipped = 0;
> +     if (ps_next_request(q))
> +             q->request_fn(q);
> +     spin_unlock_irqrestore(q->queue_lock, flags);
> +}    
> +
> +
>  static struct ps_queue *
>  __ps_get_queue(struct ps_data *psd, unsigned long key, int gfp_mask)
>  {
> @@ -1215,9 +1348,25 @@ retry:
>               INIT_LIST_HEAD(&psq->fifo[0]);
>               INIT_LIST_HEAD(&psq->fifo[1]);
>  
> +             psq->psrate = cki_tsk_psrate(psd,current);
> +             if (!psq->psrate) {
> +                 printk(KERN_WARNING "%s: psrate not found\n",__FUNCTION__);
> +                 psq->psrate = &cki_def_psrate;
> +             }
> +
> +             psq->epstart = sched_clock();
> +             init_timer(&psq->timer);
> +             psq->timer.function = ps_pauseq_timer;
> +             psq->timer.data = (unsigned long) psq;
> +             INIT_WORK(&psq->work, ps_pauseq_work, psq); 
> +
> +
>               psq->key = key;
>               hlist_add_head(&psq->ps_hash, &psd->ps_hash[hashval]);
> -             atomic_set(&psq->ref, 0);
> +             /* Refcount set to one to account for the CKRM class 
> +              *  corresponding to this queue. 
> +              */
> +             atomic_set(&psq->ref, 1);
>               psq->psd = psd;
>               atomic_inc(&psd->ref);
>               psq->key_type = psd->key_type;
> @@ -1227,6 +1376,7 @@ retry:
>       if (new_psq)
>               kmem_cache_free(ps_pool, new_psq);
>  
> +     /* incr ref count for each request using the psq */
>       atomic_inc(&psq->ref);
>  out:
>       WARN_ON((gfp_mask & __GFP_WAIT) && !psq);
> @@ -1472,6 +1622,7 @@ out_lock:
>       return 1;
>  }
>  
> +
>  static void ps_put_psd(struct ps_data *psd)
>  {
>       request_queue_t *q = psd->queue;
> @@ -1479,6 +1630,7 @@ static void ps_put_psd(struct ps_data *p
>       if (!atomic_dec_and_test(&psd->ref))
>               return;
>  
> +     cki_psd_del(psd);
>       blk_put_queue(q);
>  
>       mempool_destroy(psd->prq_pool);
> @@ -1495,27 +1647,42 @@ static void ps_exit_queue(elevator_t *e)
>  static int ps_init_queue(request_queue_t *q, elevator_t *e)
>  {
>       struct ps_data *psd;
> -     int i;
> +     struct psd_list_entry *psdl;
> +     int i,rc;
>  
>       psd = kmalloc(sizeof(*psd), GFP_KERNEL);
>       if (!psd)
>               return -ENOMEM;
>  
> +     psdl = kmalloc(sizeof(*psdl), GFP_KERNEL);
> +     if (!psdl)
> +             goto out_psd;
> +     INIT_LIST_HEAD(&psdl->psd_list);
> +     psdl->psd = psd;
> +
>       memset(psd, 0, sizeof(*psd));
>       INIT_LIST_HEAD(&psd->rr_list);
>       INIT_LIST_HEAD(&psd->empty_list);
>  
> -     psd->prq_hash = kmalloc(sizeof(struct hlist_head) * PS_MHASH_ENTRIES, 
> GFP_KERNEL);
> +     rc = cki_psd_init(psd);
> +     if (rc)
> +             goto out_psdl;
> +
> +
> +     psd->prq_hash = kmalloc(sizeof(struct hlist_head) * PS_MHASH_ENTRIES, 
> +                             GFP_KERNEL);
>       if (!psd->prq_hash)
> -             goto out_prqhash;
> +             goto out_psdl;
>  
> -     psd->ps_hash = kmalloc(sizeof(struct hlist_head) * PS_QHASH_ENTRIES, 
> GFP_KERNEL);
> +     psd->ps_hash = kmalloc(sizeof(struct hlist_head) * PS_QHASH_ENTRIES, 
> +                            GFP_KERNEL);
>       if (!psd->ps_hash)
> -             goto out_pshash;
> +             goto out_prqhash;
>  
> -     psd->prq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, 
> mempool_free_slab, prq_pool);
> +     psd->prq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, 
> +                                    mempool_free_slab, prq_pool);
>       if (!psd->prq_pool)
> -             goto out_prqpool;
> +             goto out_pshash;
>  
>       for (i = 0; i < PS_MHASH_ENTRIES; i++)
>               INIT_HLIST_HEAD(&psd->prq_hash[i]);
> @@ -1527,6 +1694,10 @@ static int ps_init_queue(request_queue_t
>       psd->queue = q;
>       atomic_inc(&q->refcnt);
>  
> +     down_write(&psdlistsem);
> +     list_add(&psdl->psd_list,&ps_psdlist);
> +     up_write(&psdlistsem);
> +
>       /*
>        * just set it to some high value, we want anyone to be able to queue
>        * some requests. fairness is handled differently
> @@ -1546,12 +1717,18 @@ static int ps_init_queue(request_queue_t
>       psd->ps_back_max = ps_back_max;
>       psd->ps_back_penalty = ps_back_penalty;
>  
> +     psd->ps_epoch = PS_EPOCH;
> +     psd->ps_hmax_pct = PS_HMAX_PCT;
> +
> +
>       return 0;
> -out_prqpool:
> -     kfree(psd->ps_hash);
>  out_pshash:
> -     kfree(psd->prq_hash);
> +     kfree(psd->ps_hash);
>  out_prqhash:
> +     kfree(psd->prq_hash);
> +out_psdl:
> +     kfree(psdl);
> +out_psd:
>       kfree(psd);
>       return -ENOMEM;
>  }
> @@ -1589,6 +1766,17 @@ fail:
>       return -ENOMEM;
>  }
>  
> +/* Exported functions */
> +int ps_drop_psq(struct ps_data *psd, unsigned long key)
> +{
> +     struct ps_queue *psq = ps_find_ps_hash(psd, key);
> +     if (!psq)
> +             return -1;
> +
> +     ps_put_queue(psq);
> +     return 0;
> +}
> +EXPORT_SYMBOL(ps_drop_psq);
>  
>  /*
>   * sysfs parts below -->
> @@ -1633,6 +1821,8 @@ ps_set_key_type(struct ps_data *psd, con
>               psd->key_type = PS_KEY_UID;
>       else if (!strncmp(page, "gid", 3))
>               psd->key_type = PS_KEY_GID;
> +     else if (!strncmp(page, "taskclass", 3))
> +             psd->key_type = PS_KEY_TASKCLASS;
>       spin_unlock_irq(psd->queue->queue_lock);
>       return count;
>  }
> @@ -1654,7 +1844,7 @@ ps_read_key_type(struct ps_data *psd, ch
>  }
>  
>  #define SHOW_FUNCTION(__FUNC, __VAR, __CONV)                         \
> -static ssize_t __FUNC(struct ps_data *psd, char *page)               \
> +static ssize_t __FUNC(struct ps_data *psd, char *page)                       
> \
>  {                                                                    \
>       unsigned int __data = __VAR;                                    \
>       if (__CONV)                                                     \
> @@ -1669,6 +1859,10 @@ SHOW_FUNCTION(ps_fifo_batch_expire_show,
>  SHOW_FUNCTION(ps_find_best_show, psd->find_best_prq, 0);
>  SHOW_FUNCTION(ps_back_max_show, psd->ps_back_max, 0);
>  SHOW_FUNCTION(ps_back_penalty_show, psd->ps_back_penalty, 0);
> +SHOW_FUNCTION(ps_epoch_show, psd->ps_epoch,0);
> +SHOW_FUNCTION(ps_hmax_pct_show, psd->ps_hmax_pct,0);
> +SHOW_FUNCTION(ps_max_sectorate_show, psd->ps_max_sectorate,0);
> +SHOW_FUNCTION(ps_min_sectorate_show, psd->ps_min_sectorate,0);
>  #undef SHOW_FUNCTION
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)                      
> \
> @@ -1694,6 +1888,10 @@ STORE_FUNCTION(ps_fifo_batch_expire_stor
>  STORE_FUNCTION(ps_find_best_store, &psd->find_best_prq, 0, 1, 0);
>  STORE_FUNCTION(ps_back_max_store, &psd->ps_back_max, 0, UINT_MAX, 0);
>  STORE_FUNCTION(ps_back_penalty_store, &psd->ps_back_penalty, 1, UINT_MAX, 0);
> +STORE_FUNCTION(ps_epoch_store, &psd->ps_epoch, 0, INT_MAX,0);
> +STORE_FUNCTION(ps_hmax_pct_store, &psd->ps_hmax_pct, 1, 100,0);
> +STORE_FUNCTION(ps_max_sectorate_store, &psd->ps_max_sectorate, 0, INT_MAX,0);
> +STORE_FUNCTION(ps_min_sectorate_store, &psd->ps_min_sectorate, 0, INT_MAX,0);
>  #undef STORE_FUNCTION
>  
>  static struct ps_fs_entry ps_quantum_entry = {
> @@ -1745,6 +1943,27 @@ static struct ps_fs_entry ps_key_type_en
>       .show = ps_read_key_type,
>       .store = ps_set_key_type,
>  };
> +static struct ps_fs_entry ps_epoch_entry = {
> +     .attr = {.name = "epoch", .mode = S_IRUGO | S_IWUSR },
> +     .show = ps_epoch_show,
> +     .store = ps_epoch_store,
> +};
> +static struct ps_fs_entry ps_hmax_pct_entry = {
> +     .attr = {.name = "hmaxpct", .mode = S_IRUGO | S_IWUSR },
> +     .show = ps_hmax_pct_show,
> +     .store = ps_hmax_pct_store,
> +};
> +static struct ps_fs_entry ps_max_sectorate_entry = {
> +     .attr = {.name = "max_sectorate", .mode = S_IRUGO | S_IWUSR },
> +     .show = ps_max_sectorate_show,
> +     .store = ps_max_sectorate_store,
> +};
> +static struct ps_fs_entry ps_min_sectorate_entry = {
> +     .attr = {.name = "min_sectorate", .mode = S_IRUGO | S_IWUSR },
> +     .show = ps_min_sectorate_show,
> +     .store = ps_min_sectorate_store,
> +};
> +
>  
>  static struct attribute *default_attrs[] = {
>       &ps_quantum_entry.attr,
> @@ -1757,6 +1976,10 @@ static struct attribute *default_attrs[]
>       &ps_back_max_entry.attr,
>       &ps_back_penalty_entry.attr,
>       &ps_clear_elapsed_entry.attr,
> +     &ps_epoch_entry.attr,
> +     &ps_hmax_pct_entry.attr,
> +     &ps_max_sectorate_entry.attr,
> +     &ps_min_sectorate_entry.attr,
>       NULL,
>  };
>  
> Index: linux-2.6.12-rc3/include/linux/ckrm-io.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.12-rc3/include/linux/ckrm-io.h
> @@ -0,0 +1,134 @@
> +#ifndef _LINUX_CKRM_IO_H
> +#define _LINUX_CKRM_IO_H
> +
> +
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/mempool.h>
> +#include <linux/ckrm_rc.h>
> +#include <linux/ckrm_tc.h>
> +
> +
> +/* root's default sectorate value which
> + * also serves as base for absolute shares.
> + * Configurable through taskclass' config file. 
> + */
> +struct cki_data {
> +     /* Protects both */
> +     spinlock_t cfglock; 
> +     /* root's absolute shares serve as base for other classes */
> +     int rootsectorate;
> +     /* absolute share assigned when relative share is "don't care" */ 
> +     int minsectorate;
> +};
> +
> +
> +struct ps_data {
> +     struct list_head rr_list;
> +     struct list_head empty_list;
> +
> +     struct hlist_head *ps_hash;
> +     struct hlist_head *prq_hash;
> +
> +     struct list_head psdlist;
> +
> +
> +
> +     /* queues on rr_list (ie they have pending requests */
> +     unsigned int busy_queues;
> +
> +     unsigned int max_queued;
> +
> +     atomic_t ref;
> +
> +     int key_type;
> +
> +     mempool_t *prq_pool;
> +
> +     request_queue_t *queue;
> +
> +     sector_t last_sector;
> +
> +     int rq_in_driver;
> +
> +     /*
> +      * tunables, see top of file
> +      */
> +     unsigned int ps_quantum;
> +     unsigned int ps_queued;
> +     unsigned int ps_fifo_expire_r;
> +     unsigned int ps_fifo_expire_w;
> +     unsigned int ps_fifo_batch_expire;
> +     unsigned int ps_back_penalty;
> +     unsigned int ps_back_max;
> +     unsigned int find_best_prq;
> +
> +     unsigned int ps_tagged;
> +
> +     /* duration over which sectorates enforced */
> +     unsigned int ps_epoch;
> +     /* low-water mark (%) for resuming service of overshare ps_queues */
> +     unsigned int ps_hmax_pct;
> +     /* total sectors that queue can sustain */
> +     unsigned int ps_max_sectorate; 
> +     /* absolute sectorate when share is a "dontcare" */
> +     unsigned int ps_min_sectorate;
> +
> +};
> +
> +/* For linking all psd's of ps-iosched */
> +struct psd_list_entry {
> +     struct list_head psd_list;
> +     struct ps_data *psd;
> +};
> +
> +/* Data for regulating sectors served */
> +struct ps_rate {
> +     int nskip;
> +     unsigned long navsec;
> +     int timedout;
> +     atomic_t sectorate;
> +     u64 sec[2];
> +};
> +
> +/* To maintain psrate data structs for each
> +   request queue managed by ps-iosched */
> +
> +struct psdrate {
> +    struct list_head rate_list;
> +    struct ps_data *psd;
> +    struct ps_rate psrate;
> +};
> +
> +extern struct ckrm_res_ctlr cki_rcbs;
> +extern struct cki_data ckid;
> +extern struct ps_rate cki_def_psrate; 
> +
> +extern struct rw_semaphore psdlistsem;
> +extern struct list_head ps_psdlist;
> +
> +
> +
> +int cki_psd_init(struct ps_data *);
> +int cki_psd_del(struct ps_data *);
> +struct ps_rate *cki_tsk_psrate(struct ps_data *, struct task_struct *); 
> +
> +
> +
> +#if 0
> +typedef void *(*icls_tsk_t) (struct task_struct *tsk);
> +typedef int (*icls_ioprio_t) (struct task_struct *tsk);
> +
> +
> +#ifdef CONFIG_CKRM_RES_BLKIO
> +
> +extern void *cki_tsk_icls (struct task_struct *tsk);
> +extern int cki_tsk_ioprio (struct task_struct *tsk);
> +extern void *cki_tsk_cfqpriv (struct task_struct *tsk);
> +
> +#endif /* CONFIG_CKRM_RES_BLKIO */
> +
> +#endif

Why and #if 0 again?

> +
> +
> +#endif 
> Index: linux-2.6.12-rc3/include/linux/proc_fs.h
> ===================================================================
> --- linux-2.6.12-rc3.orig/include/linux/proc_fs.h
> +++ linux-2.6.12-rc3/include/linux/proc_fs.h
> @@ -93,6 +93,7 @@ struct dentry *proc_pid_lookup(struct in
>  struct dentry *proc_pid_unhash(struct task_struct *p);
>  void proc_pid_flush(struct dentry *proc_dentry);
>  int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
> +int proc_pid_delay(struct task_struct *task, char * buffer);
>  unsigned long task_vsize(struct mm_struct *);
>  int task_statm(struct mm_struct *, int *, int *, int *, int *);
>  char *task_mem(struct mm_struct *, char *);
> Index: linux-2.6.12-rc3/init/Kconfig
> ===================================================================
> --- linux-2.6.12-rc3.orig/init/Kconfig
> +++ linux-2.6.12-rc3/init/Kconfig
> @@ -182,6 +182,19 @@ config CKRM_TYPE_TASKCLASS
>       
>         Say Y if unsure
>  
> +config CKRM_RES_BLKIO
> +     tristate " Disk I/O Resource Controller"
> +     depends on CKRM_TYPE_TASKCLASS && IOSCHED_CFQ
> +     default m
> +     help
> +       Provides a resource controller for best-effort block I/O 
> +       bandwidth control. The controller attempts this by proportional 
> +       servicing of requests in the I/O scheduler. However, seek
> +       optimizations and reordering by device drivers/disk controllers may
> +       alter the actual bandwidth delivered to a class.
> +     
> +       Say N if unsure, Y to use the feature.
> +
>  config CKRM_TYPE_SOCKETCLASS
>       bool "Class Manager for socket groups"
>       depends on CKRM && RCFS_FS

Again, it would be nice if there were some way to break this up into
a couple of patches - the context and flow would be easier to review.

gerrit


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
_______________________________________________
ckrm-tech mailing list
https://lists.sourceforge.net/lists/listinfo/ckrm-tech

Re: [ckrm-tech] [PATCH] [3/3] 03-ckrm-io.patch

Reply via email to