On 4/2/07, Dan Williams <[EMAIL PROTECTED]> wrote:
On 3/30/07, Raz Ben-Jehuda(caro) <[EMAIL PROTECTED]> wrote:
> Please see bellow.
>
> On 8/28/06, Neil Brown <[EMAIL PROTECTED]> wrote:
> > On Sunday August 13, [EMAIL PROTECTED] wrote:
> > > well ... me again
> > >
> > > Following your advice....
> > >
> > > I added a deadline for every WRITE stripe head when it is created.
> > > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > > setting the sh to prereadactive mode as .
> > >
> > > This small fix ( and in few other places in the code) reduced the
> > > amount of reads
> > > to zero with dd but with no improvement to throghput. But with random
access to
> > > the raid ( buffers are aligned by the stripe width and with the size
> > > of stripe width )
> > > there is an improvement of at least 20 % .
> > >
> > > Problem is that a user must know what he is doing else there would be
> > > a reduction
> > > in performance if deadline line it too long (say 100 ms).
> >
> > So if I understand you correctly, you are delaying write requests to
> > partial stripes slightly (your 'deadline') and this is sometimes
> > giving you a 20% improvement ?
> >
> > I'm not surprised that you could get some improvement. 20% is quite
> > surprising. It would be worth following through with this to make
> > that improvement generally available.
> >
> > As you say, picking a time in milliseconds is very error prone. We
> > really need to come up with something more natural.
> > I had hopped that the 'unplug' infrastructure would provide the right
> > thing, but apparently not. Maybe unplug is just being called too
> > often.
> >
> > I'll see if I can duplicate this myself and find out what is really
> > going on.
> >
> > Thanks for the report.
> >
> > NeilBrown
> >
>
> Neil Hello. I am sorry for this interval , I was assigned abruptly to
> a different project.
>
> 1.
> I'd taken a look at the raid5 delay patch I have written a while
> ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> and when used correctly it eliminates the reads penalty.
>
> 2. Benchmarks .
> configuration:
> I am testing a raid5 x 3 disks with 1MB chunk size. IOs are
> synchronous and non-buffered(o_direct) , 2 MB in size and always
> aligned to the beginning of a stripe. kernel is 2.6.17. The
> stripe_delay was set to 10ms.
>
> Attached is the simple_write code.
>
> command :
> simple_write /dev/md1 2048 0 1000
> simple_write raw writes (O_DIRECT) sequentially
> starting from offset zero 2048 kilobytes 1000 times.
>
> Benchmark Before patch
>
> sda 1848.00 8384.00 50992.00 8384 50992
> sdb 1995.00 12424.00 51008.00 12424 51008
> sdc 1698.00 8160.00 51000.00 8160 51000
> sdd 0.00 0.00 0.00 0 0
> md0 0.00 0.00 0.00 0 0
> md1 450.00 0.00 102400.00 0 102400
>
>
> Benchmark After patch
>
> sda 389.11 0.00 128530.69 0 129816
> sdb 381.19 0.00 129354.46 0 130648
> sdc 383.17 0.00 128530.69 0 129816
> sdd 0.00 0.00 0.00 0 0
> md0 0.00 0.00 0.00 0 0
> md1 1140.59 0.00 259548.51 0 262144
>
> As one can see , no additional reads were done. One can actually
> calculate the raid's utilization: n-1/n * ( single disk throughput
> with 1M writes ) .
>
>
> 3. The patch code.
> Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> because I have noticed a big code differences between 17 to 20.x .
> This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> have not tested (yet) degraded mode or any other non-common pathes.
>
This is along the same lines of what I am working on, new cache
policies for raid5/6, so I want to give it a try as well.
Unfortunately gmail has mangled your patch. Can you resend as an
attachment?
patch: **** malformed patch at line 10:
(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
Thanks,
Dan
Dan hello.
Attached are the patches. Also , I have added another test unit : random_writev.
It is not much of a code but it does the work. It tests writing a
vector .it shows the same results as writing using a single buffer.
What is the new cache poilcies ?
Please note !
I haven't indented the patch nor did the instructions according to
SubmitingPatches document. If Neil would approve this patch or parts
of it, I will do so.
# Benchmark 3: Testing 8 disks raid5.
Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller
is promise
in jbod mode.
raid conf:
md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3]
sdc1[2] sdb2[1]
3404964864 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
In order to achieve zero reads I had to tune the deadline to 20ms ( so
long ? ). stripe_cache_size is 256 which is exactly what is needed to
preform a full stripe hit
with this configuration.
comand: random_writev /dev/md1 7168 0 3000 10000
iostats snapshot
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 21.00 29.00 50.00
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
hda 0.00 0.00 0.00 0 0
md0 0.00 0.00 0.00 0 0
sda 234.34 0.00 50400.00 0 49896
sdb 235.35 0.00 50658.59 0 50152
sdc 242.42 0.00 51014.14 0 50504
sdd 246.46 0.00 50755.56 0 50248
sde 248.48 0.00 51272.73 0 50760
sdf 245.45 0.00 50755.56 0 50248
sdg 244.44 0.00 50755.56 0 50248
sdh 245.45 0.00 50755.56 0 50248
md1 1407.07 0.00 347741.41 0 344264
Try setting it the stripe_cace_size to 255 and you will notice the delay.
Try lowering with the stripe_deadline and you will notice how the amount
of reads grow.
Cheers
--
Raz
diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/drivers/md/raid5.c linux-2.6.20.2-raid/drivers/md/raid5.c
--- linux-2.6.20.2/drivers/md/raid5.c 2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/drivers/md/raid5.c 2007-03-30 12:37:55.000000000 +0300
@@ -65,6 +65,7 @@
#define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
#define HASH_MASK (NR_HASH - 1)
+
#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
/* bio's attached to a stripe+device for I/O are linked together in bi_sector
@@ -234,6 +235,8 @@
sh->sector = sector;
sh->pd_idx = pd_idx;
sh->state = 0;
+ sh->active_preread_jiffies =
+ msecs_to_jiffies( atomic_read(&conf->deadline_ms) )+ jiffies;
sh->disks = disks;
@@ -628,6 +631,7 @@
clear_bit(R5_LOCKED, &sh->dev[i].flags);
set_bit(STRIPE_HANDLE, &sh->state);
+ sh->active_preread_jiffies = jiffies;
release_stripe(sh);
return 0;
}
@@ -1255,8 +1259,11 @@
bip = &sh->dev[dd_idx].towrite;
if (*bip == NULL && sh->dev[dd_idx].written == NULL)
firstwrite = 1;
- } else
+ } else{
bip = &sh->dev[dd_idx].toread;
+ sh->active_preread_jiffies = jiffies;
+ }
+
while (*bip && (*bip)->bi_sector < bi->bi_sector) {
if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
goto overlap;
@@ -2437,13 +2444,27 @@
-static void raid5_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head* raid5_activate_delayed(raid5_conf_t *conf)
{
if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
while (!list_empty(&conf->delayed_list)) {
struct list_head *l = conf->delayed_list.next;
struct stripe_head *sh;
sh = list_entry(l, struct stripe_head, lru);
+
+ if( time_before(jiffies,sh->active_preread_jiffies) ){
+ PRINTK("deadline : no expire sec=%lld %8u %8u\n",
+ (unsigned long long) sh->sector,
+ jiffies_to_msecs(sh->active_preread_jiffies),
+ jiffies_to_msecs(jiffies));
+ return sh;
+ }
+ else{
+ PRINTK("deadline: expire:sec=%lld %8u %8u\n",
+ (unsigned long long)sh->sector,
+ jiffies_to_msecs(sh->active_preread_jiffies),
+ jiffies_to_msecs(jiffies));
+ }
list_del_init(l);
clear_bit(STRIPE_DELAYED, &sh->state);
if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -2451,6 +2472,7 @@
list_add_tail(&sh->lru, &conf->handle_list);
}
}
+ return NULL;
}
static void activate_bit_delay(raid5_conf_t *conf)
@@ -3191,7 +3213,7 @@
*/
static void raid5d (mddev_t *mddev)
{
- struct stripe_head *sh;
+ struct stripe_head *sh,*delayed_sh=NULL;
raid5_conf_t *conf = mddev_to_conf(mddev);
int handled;
@@ -3218,8 +3240,10 @@
atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
!blk_queue_plugged(mddev->queue) &&
!list_empty(&conf->delayed_list))
- raid5_activate_delayed(conf);
-
+ delayed_sh=raid5_activate_delayed(conf);
+
+ if(delayed_sh) break;
+
while ((bio = remove_bio_from_retry(conf))) {
int ok;
spin_unlock_irq(&conf->device_lock);
@@ -3254,9 +3278,51 @@
unplug_slaves(mddev);
PRINTK("--- raid5d inactive\n");
+ if (delayed_sh){
+ long wakeup=delayed_sh->active_preread_jiffies-jiffies;
+ PRINTK("--- raid5d inactive sleep for %d\n",
+ jiffies_to_msecs(wakeup) );
+ if (wakeup>0)
+ mddev->thread->timeout = wakeup;
+ }
+}
+
+static ssize_t
+raid5_show_stripe_deadline(mddev_t *mddev, char *page)
+{
+ raid5_conf_t *conf = mddev_to_conf(mddev);
+ if (conf)
+ return sprintf(page, "%d\n", atomic_read(&conf->deadline_ms));
+ else
+ return 0;
}
static ssize_t
+raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len)
+{
+ raid5_conf_t *conf = mddev_to_conf(mddev);
+ char *end;
+ int new;
+ if (len >= PAGE_SIZE)
+ return -EINVAL;
+ if (!conf)
+ return -ENODEV;
+ new = simple_strtoul(page, &end, 10);
+ if (!*page || (*end && *end != '\n') )
+ return -EINVAL;
+ if (new < 0 || new > 10000)
+ return -EINVAL;
+ atomic_set(&conf->deadline_ms,new);
+ return len;
+}
+
+static struct md_sysfs_entry
+raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR,
+ raid5_show_stripe_deadline,
+ raid5_store_stripe_deadline);
+
+
+static ssize_t
raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
{
raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3297,6 +3363,9 @@
return len;
}
+
+
+
static struct md_sysfs_entry
raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
raid5_show_stripe_cache_size,
@@ -3318,8 +3387,10 @@
static struct attribute *raid5_attrs[] = {
&raid5_stripecache_size.attr,
&raid5_stripecache_active.attr,
+ &raid5_stripe_deadline.attr,
NULL,
};
+
static struct attribute_group raid5_attrs_group = {
.name = NULL,
.attrs = raid5_attrs,
@@ -3567,6 +3638,8 @@
blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
+ atomic_set(&conf->deadline_ms,0);
+
return 0;
abort:
if (conf) {
diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/include/linux/raid/raid5.h linux-2.6.20.2-raid/include/linux/raid/raid5.h
--- linux-2.6.20.2/include/linux/raid/raid5.h 2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/include/linux/raid/raid5.h 2007-03-30 00:25:38.000000000 +0200
@@ -136,6 +136,7 @@
spinlock_t lock;
int bm_seq; /* sequence number for bitmap flushes */
int disks; /* disks in stripe */
+ unsigned long active_preread_jiffies;
struct r5dev {
struct bio req;
struct bio_vec vec;
@@ -254,6 +255,7 @@
* Free stripes pool
*/
atomic_t active_stripes;
+ atomic_t deadline_ms;
struct list_head inactive_list;
wait_queue_head_t wait_for_stripe;
wait_queue_head_t wait_for_overlap;
#define _LARGEFILE64_SOURC
#include <iostream>
#include <stdio.h>
#include <string>
#include <stddef.h>
#include <sys/time.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <libaio.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>
#include <sys/uio.h>
#include <unistd.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <errno.h>
using namespace std;
int main (int argc, char *argv[])
{
if (argc<6){
cout << "usage <device name> <size to write in kb> <offset in kb > <diskSizeGB> <loops>" << endl;
return 0;
}
char* dev_name = argv[1];
int fd = open(dev_name, O_LARGEFILE | O_DIRECT | O_WRONLY , 777 );
if (fd<0){
perror("open ");
return (-1);
}
long long write_sz_bytes = ( (long long)atoi(argv[2]))<<10;
long long offset_sz_bytes = ( (long long) atoi(argv[3]) )<<10;
long long diskSizeBytes = ( (long long)atoi(argv[4]))<<30;
int loops = atoi(argv[5]);
struct iovec vec[10];
int blocks = (write_sz_bytes >>20);
for( int i = 0 ; i < blocks; i++){
char* buffer = (char*)valloc((1<<20));
if (!buffer) {
perror("alloc : ");
return -1;
}
vec[i].iov_base = buffer;
vec[i].iov_len = 1048576;
memset(buffer,0x00,1048576);
}
int ret=0;
while( (--loops)>0 ){
if ( lseek64(fd,offset_sz_bytes,SEEK_SET) < 0 ){
printf("%s: failed on lseek offset=%lld\n",offset_sz_bytes);
return (0);
}
ret = writev(fd,(struct iovec*)&vec,blocks);
if ( ret != write_sz_bytes ) {
perror("failed to write: ");
printf("write size=%lld offset=%lld\n",write_sz_bytes,offset_sz_bytes);
return -1;
}
offset_sz_bytes = write_sz_bytes *( random() % diskSizeBytes );
long long rnd = (long long)random();
offset_sz_bytes = write_sz_bytes * (long long)( rnd % diskSizeBytes );
if(offset_sz_bytes>diskSizeBytes){
offset_sz_bytes = (offset_sz_bytes - diskSizeBytes ) % diskSizeBytes;
offset_sz_bytes = (offset_sz_bytes/write_sz_bytes)*write_sz_bytes;
}
printf("writing %d bytes at offset %lld\n",ret,offset_sz_bytes);
}
return(0);
}