Re: PATCH - raid5 performance improvement - 3 of 3

Gadi Oxman Sun, 24 Jun 2001 12:50:53 -0700
Hi,

We used to (long ago, 2.2.x), whenever we got a write request for some
buffer,
search the buffer cache to see if additional buffers which belong to that
particular stripe are dirty, and then schedule them for writing as well, in
an
attempt to write full stripes. That resulted in a huge sequential write
performance
improvement.

If such an approach is still possible today, it is preferrable to delaying
the writes
for the partial buffer while hoping that the rest of the bufferes in the
stripe would
come as well, since it both eliminates the additional delay, and doesn't
depend on the order in which the bufferes are flushed from the much bigger
memory buffers to the smaller stripe cache.

Cheers,

Gadi

----- Original Message -----
From: "Neil Brown" <[EMAIL PROTECTED]>
To: "Linus Torvalds" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, June 20, 2001 6:51 PM
Subject: PATCH - raid5 performance improvement - 3 of 3


>
>
> Linus, and fellow RAIDers,
>
>  This is the third in my three patch series for improving RAID5
>  throughput.
>  This one substantially lifts write thoughput by leveraging the
>  opportunities for write gathering provided by the first patch.
>
>  With RAID5, it is much more efficient to write a whole stripe full of
>  data at a time as this avoids the need to pre-read any old data or
>  parity from the discs.
>
>  Without this patch, when a write request arrives, raid5 will
>  immediately start a couple of pre-reads so that it will be able to
>  write that block and update the parity.
>  By the time that the old data and parity arrive it is quite possible
>  that write requests for all the other blocks in the stripe will have
>  been submitted, and the old data and parity will not be needed.
>
>  This patch uses concepts similar to queue plugging to delay write
>  requests slightly to improve the chance that many or even all of the
>  data blocks in a stripe will have outstanding write requests before
>  processing is started.
>
>  To do this it maintains a queue of stripes that seem to require
>  pre-reading.
>  Stripes are only released  from this queue when there are no other
>  pre-read requests active, and then only if the raid5 device is not
>  currently "plugged".
>
>  As I mentioned earlier, my testing shows substantial improvements
>  from these three patches for both sequential (bonnie) and random
>  (dbench) access patterns.  I would be particularly interested if
>  anyone else does any different testing, preferable comparing
>  2.2.19+patches with 2.4.5 and then with 2.4.5-plus these patches.
>
>  I know of one problem area being sequential writes to a 3 disc
>  array.  If anyone can find any other access patterns that still
>  perform below 2.2.19 levels, I would really like to know about them.
>
> NeilBrown
>
>
>
> --- ./include/linux/raid/raid5.h 2001/06/21 01:01:46 1.3
> +++ ./include/linux/raid/raid5.h 2001/06/21 01:04:05 1.4
> @@ -158,6 +158,32 @@
>  #define STRIPE_HANDLE 2
>  #define STRIPE_SYNCING 3
>  #define STRIPE_INSYNC 4
> +#define STRIPE_PREREAD_ACTIVE 5
> +#define STRIPE_DELAYED 6
> +
> +/*
> + * Plugging:
> + *
> + * To improve write throughput, we need to delay the handling of some
> + * stripes until there has been a chance that several write requests
> + * for the one stripe have all been collected.
> + * In particular, any write request that would require pre-reading
> + * is put on a "delayed" queue until there are no stripes currently
> + * in a pre-read phase.  Further, if the "delayed" queue is empty when
> + * a stripe is put on it then we "plug" the queue and do not process it
> + * until an unplg call is made. (the tq_disk list is run).
> + *
> + * When preread is initiated on a stripe, we set PREREAD_ACTIVE and add
> + * it to the count of prereading stripes.
> + * When write is initiated, or the stripe refcnt == 0 (just in case) we
> + * clear the PREREAD_ACTIVE flag and decrement the count
> + * Whenever the delayed queue is empty and the device is not plugged, we
> + * move any strips from delayed to handle and clear the DELAYED flag and
set PREREAD_ACTIVE.
> + * In stripe_handle, if we find pre-reading is necessary, we do it if
> + * PREREAD_ACTIVE is set, else we set DELAYED which will send it to the
delayed queue.
> + * HANDLE gets cleared if stripe_handle leave nothing locked.
> + */
> +
>
>  struct disk_info {
>   kdev_t dev;
> @@ -182,6 +208,8 @@
>   int max_nr_stripes;
>
>   struct list_head handle_list; /* stripes needing handling */
> + struct list_head delayed_list; /* stripes that have plugged requests */
> + atomic_t preread_active_stripes; /* stripes with scheduled io */
>   /*
>   * Free stripes pool
>   */
> @@ -192,6 +220,9 @@
>   * waiting for 25% to be free
>   */
>   md_spinlock_t device_lock;
> +
> + int plugged;
> + struct tq_struct plug_tq;
>  };
>
>  typedef struct raid5_private_data raid5_conf_t;
> --- ./drivers/md/raid5.c 2001/06/21 01:01:46 1.3
> +++ ./drivers/md/raid5.c 2001/06/21 01:04:05 1.4
> @@ -31,6 +31,7 @@
>   */
>
>  #define NR_STRIPES 256
> +#define IO_THRESHOLD 1
>  #define HASH_PAGES 1
>  #define HASH_PAGES_ORDER 0
>  #define NR_HASH (HASH_PAGES * PAGE_SIZE / sizeof(struct stripe_head *))
> @@ -65,11 +66,17 @@
>   BUG();
>   if (atomic_read(&conf->active_stripes)==0)
>   BUG();
> - if (test_bit(STRIPE_HANDLE, &sh->state)) {
> + if (test_bit(STRIPE_DELAYED, &sh->state))
> + list_add_tail(&sh->lru, &conf->delayed_list);
> + else if (test_bit(STRIPE_HANDLE, &sh->state)) {
>   list_add_tail(&sh->lru, &conf->handle_list);
>   md_wakeup_thread(conf->thread);
> - }
> - else {
> + } else {
> + if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> + atomic_dec(&conf->preread_active_stripes);
> + if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
> + md_wakeup_thread(conf->thread);
> + }
>   list_add_tail(&sh->lru, &conf->inactive_list);
>   atomic_dec(&conf->active_stripes);
>   if (!conf->inactive_blocked ||
> @@ -823,6 +830,7 @@
>
>   spin_lock(&sh->lock);
>   clear_bit(STRIPE_HANDLE, &sh->state);
> + clear_bit(STRIPE_DELAYED, &sh->state);
>
>   syncing = test_bit(STRIPE_SYNCING, &sh->state);
>   /* Now to look around and see what can be done */
> @@ -994,10 +1002,16 @@
>   if ((sh->bh_write[i] || i == sh->pd_idx) &&
>       !buffer_locked(bh) && !buffer_uptodate(bh) &&
>       conf->disks[i].operational) {
> - PRINTK("Read_old block %d for r-m-w\n", i);
> - set_bit(BH_Lock, &bh->b_state);
> - action[i] = READ+1;
> - locked++;
> + if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> + {
> + PRINTK("Read_old block %d for r-m-w\n", i);
> + set_bit(BH_Lock, &bh->b_state);
> + action[i] = READ+1;
> + locked++;
> + } else {
> + set_bit(STRIPE_DELAYED, &sh->state);
> + set_bit(STRIPE_HANDLE, &sh->state);
> + }
>   }
>   }
>   if (rcw <= rmw && rcw > 0)
> @@ -1007,10 +1021,16 @@
>   if (!sh->bh_write[i]  && i != sh->pd_idx &&
>       !buffer_locked(bh) && !buffer_uptodate(bh) &&
>       conf->disks[i].operational) {
> - PRINTK("Read_old block %d for Reconstruct\n", i);
> - set_bit(BH_Lock, &bh->b_state);
> - action[i] = READ+1;
> - locked++;
> + if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> + {
> + PRINTK("Read_old block %d for Reconstruct\n", i);
> + set_bit(BH_Lock, &bh->b_state);
> + action[i] = READ+1;
> + locked++;
> + } else {
> + set_bit(STRIPE_DELAYED, &sh->state);
> + set_bit(STRIPE_HANDLE, &sh->state);
> + }
>   }
>   }
>   /* now if nothing is locked, and if we have enough data, we can start a
write request */
> @@ -1027,6 +1047,11 @@
>       || (i==sh->pd_idx && failed == 0))
>   set_bit(STRIPE_INSYNC, &sh->state);
>   }
> + if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
> + atomic_dec(&conf->preread_active_stripes);
> + if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
> + md_wakeup_thread(conf->thread);
> + }
>   }
>   }
>
> @@ -1118,6 +1143,47 @@
>   }
>  }
>
> +static inline void raid5_activate_delayed(raid5_conf_t *conf)
> +{
> + if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
> + while (!list_empty(&conf->delayed_list)) {
> + struct list_head *l = conf->delayed_list.next;
> + struct stripe_head *sh;
> + sh = list_entry(l, struct stripe_head, lru);
> + list_del_init(l);
> + clear_bit(STRIPE_DELAYED, &sh->state);
> + if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> + atomic_inc(&conf->preread_active_stripes);
> + list_add_tail(&sh->lru, &conf->handle_list);
> + }
> + }
> +}
> +static void raid5_unplug_device(void *data)
> +{
> + raid5_conf_t *conf = (raid5_conf_t *)data;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&conf->device_lock, flags);
> +
> + raid5_activate_delayed(conf);
> +
> + if (conf->plugged) {
> + conf->plugged = 0;
> + md_wakeup_thread(conf->thread);
> + }
> + spin_unlock_irqrestore(&conf->device_lock, flags);
> +}
> +
> +static inline void raid5_plug_device(raid5_conf_t *conf)
> +{
> + spin_lock_irq(&conf->device_lock);
> + if (list_empty(&conf->delayed_list))
> + if (!conf->plugged) {
> + conf->plugged = 1;
> + queue_task(&conf->plug_tq, &tq_disk);
> + }
> + spin_unlock_irq(&conf->device_lock);
> +}
>
>  static int raid5_make_request (mddev_t *mddev, int rw, struct buffer_head
* bh)
>  {
> @@ -1144,6 +1210,8 @@
>   sh->pd_idx = pd_idx;
>
>   add_stripe_bh(sh, bh, dd_idx, rw);
> +
> + raid5_plug_device(conf);
>   handle_stripe(sh);
>   release_stripe(sh);
>   } else
> @@ -1223,8 +1291,19 @@
>   md_update_sb(mddev);
>   }
>   md_spin_lock_irq(&conf->device_lock);
> - while (!list_empty(&conf->handle_list)) {
> - struct list_head *first = conf->handle_list.next;
> + while (1) {
> + struct list_head *first;
> +
> + if (list_empty(&conf->handle_list) &&
> +     atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
> +     !conf->plugged &&
> +     !list_empty(&conf->delayed_list))
> + raid5_activate_delayed(conf);
> +
> + if (list_empty(&conf->handle_list))
> + break;
> +
> + first = conf->handle_list.next;
>   sh = list_entry(first, struct stripe_head, lru);
>
>   list_del_init(first);
> @@ -1303,9 +1382,16 @@
>   conf->device_lock = MD_SPIN_LOCK_UNLOCKED;
>   md_init_waitqueue_head(&conf->wait_for_stripe);
>   INIT_LIST_HEAD(&conf->handle_list);
> + INIT_LIST_HEAD(&conf->delayed_list);
>   INIT_LIST_HEAD(&conf->inactive_list);
>   atomic_set(&conf->active_stripes, 0);
> + atomic_set(&conf->preread_active_stripes, 0);
>   conf->buffer_size = PAGE_SIZE; /* good default for rebuild */
> +
> + conf->plugged = 0;
> + conf->plug_tq.sync = 0;
> + conf->plug_tq.routine = &raid5_unplug_device;
> + conf->plug_tq.data = conf;
>
>   PRINTK("raid5_run(md%d) called.\n", mdidx(mddev));
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
>

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Re: PATCH - raid5 performance improvement - 3 of 3

Reply via email to