Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet
On Mon, 14 Jan 2008, NeilBrown wrote:

 
 raid5's 'make_request' function calls generic_make_request on
 underlying devices and if we run out of stripe heads, it could end up
 waiting for one of those requests to complete.
 This is bad as recursive calls to generic_make_request go on a queue
 and are not even attempted until make_request completes.
 
 So: don't make any generic_make_request calls in raid5 make_request
 until all waiting has been done.  We do this by simply setting
 STRIPE_HANDLE instead of calling handle_stripe().
 
 If we need more stripe_heads, raid5d will get called to process the
 pending stripe_heads which will call generic_make_request from a
 different thread where no deadlock will happen.
 
 
 This change by itself causes a performance hit.  So add a change so
 that raid5_activate_delayed is only called at unplug time, never in
 raid5.  This seems to bring back the performance numbers.  Calling it
 in raid5d was sometimes too soon...
 
 Cc: Dan Williams [EMAIL PROTECTED]
 Signed-off-by: Neil Brown [EMAIL PROTECTED]

probably doesn't matter, but for the record:

Tested-by: dean gaudet [EMAIL PROTECTED]

this time i tested with internal and external bitmaps and it survived 8h 
and 14h resp. under the parallel tar workload i used to reproduce the 
hang.

btw this should probably be a candidate for 2.6.22 and .23 stable.

thanks
-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet
On Tue, 15 Jan 2008, Andrew Morton wrote:

 On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] 
 wrote:
 
  On Mon, 14 Jan 2008, NeilBrown wrote:
  
   
   raid5's 'make_request' function calls generic_make_request on
   underlying devices and if we run out of stripe heads, it could end up
   waiting for one of those requests to complete.
   This is bad as recursive calls to generic_make_request go on a queue
   and are not even attempted until make_request completes.
   
   So: don't make any generic_make_request calls in raid5 make_request
   until all waiting has been done.  We do this by simply setting
   STRIPE_HANDLE instead of calling handle_stripe().
   
   If we need more stripe_heads, raid5d will get called to process the
   pending stripe_heads which will call generic_make_request from a
   different thread where no deadlock will happen.
   
   
   This change by itself causes a performance hit.  So add a change so
   that raid5_activate_delayed is only called at unplug time, never in
   raid5.  This seems to bring back the performance numbers.  Calling it
   in raid5d was sometimes too soon...
   
   Cc: Dan Williams [EMAIL PROTECTED]
   Signed-off-by: Neil Brown [EMAIL PROTECTED]
  
  probably doesn't matter, but for the record:
  
  Tested-by: dean gaudet [EMAIL PROTECTED]
  
  this time i tested with internal and external bitmaps and it survived 8h 
  and 14h resp. under the parallel tar workload i used to reproduce the 
  hang.
  
  btw this should probably be a candidate for 2.6.22 and .23 stable.
  
 
 hm, Neil said
 
   The first fixes a bug which could make it a candidate for 24-final. 
   However it is a deadlock that seems to occur very rarely, and has been in
   mainline since 2.6.22.  So letting it into one more release shouldn't be
   a big problem.  While the fix is fairly simple, it could have some
   unexpected consequences, so I'd rather go for the next cycle.
 
 food fight!
 

heheh.

it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.

but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...

so OK i agree with Neil, let's test more... food fight over! :)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Thu, 10 Jan 2008, Neil Brown wrote:

 On Wednesday January 9, [EMAIL PROTECTED] wrote:
  On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
   i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
   
   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
   
   which was Neil's change in 2.6.22 for deferring generic_make_request 
   until there's enough stack space for it.
   
  
  Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
  by preventing recursive calls to generic_make_request.  However the
  following conditions can cause raid5 to hang until 'stripe_cache_size' is
  increased:
  
 
 Thanks for pursuing this guys.  That explanation certainly sounds very
 credible.
 
 The generic_make_request_immed is a good way to confirm that we have
 found the bug,  but I don't like it as a long term solution, as it
 just reintroduced the problem that we were trying to solve with the
 problematic commit.
 
 As you say, we could arrange that all request submission happens in
 raid5d and I think this is the right way to proceed.  However we can
 still take some of the work into the thread that is submitting the
 IO by calling raid5d() at the end of make_request, like this.
 
 Can you test it please?  Does it seem reasonable?
 
 Thanks,
 NeilBrown
 
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]

it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's 
pretty good evidence it works for me.  thanks!

Tested-by: dean gaudet [EMAIL PROTECTED]

 
 ### Diffstat output
  ./drivers/md/md.c|2 +-
  ./drivers/md/raid5.c |4 +++-
  2 files changed, 4 insertions(+), 2 deletions(-)
 
 diff .prev/drivers/md/md.c ./drivers/md/md.c
 --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100
 +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100
 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
   if (mddev-ro)
   return;
  
 - if (signal_pending(current)) {
 + if (current == mddev-thread-tsk  signal_pending(current)) {
   if (mddev-pers-sync_request) {
   printk(KERN_INFO md: %s in immediate safe mode\n,
  mdname(mddev));
 
 diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
 --- .prev/drivers/md/raid5.c  2008-01-07 13:32:10.0 +1100
 +++ ./drivers/md/raid5.c  2008-01-10 11:06:54.0 +1100
 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
   }
  }
  
 +static void raid5d (mddev_t *mddev);
  
  static int make_request(struct request_queue *q, struct bio * bi)
  {
 @@ -3547,7 +3548,7 @@ static int make_request(struct request_q
   goto retry;
   }
   finish_wait(conf-wait_for_overlap, w);
 - handle_stripe(sh, NULL);
 + set_bit(STRIPE_HANDLE, sh-state);
   release_stripe(sh);
   } else {
   /* cannot get stripe for read-ahead, just give-up */
 @@ -3569,6 +3570,7 @@ static int make_request(struct request_q
 test_bit(BIO_UPTODATE, bi-bi_flags)
   ? 0 : -EIO);
   }
 + raid5d(mddev);
   return 0;
  }
  
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Fri, 11 Jan 2008, Neil Brown wrote:

 Thanks.
 But I suspect you didn't test it with a bitmap :-)
 I ran the mdadm test suite and it hit a problem - easy enough to fix.

damn -- i lost my bitmap 'cause it was external and i didn't have things 
set up properly to pick it up after a reboot :)

if you send an updated patch i'll give it another spin...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1, can't get the second disk added back in.

2008-01-09 Thread dean gaudet
On Tue, 8 Jan 2008, Bill Davidsen wrote:

 Neil Brown wrote:
  On Monday January 7, [EMAIL PROTECTED] wrote:

   Problem is not raid, or at least not obviously raid related.  The problem
   is that the whole disk, /dev/hdb is unavailable. 
  
  Maybe check /sys/block/hdb/holders ?  lsof /dev/hdb ?
  
  good luck :-)
  

 losetup -a may help, lsof doesn't seem to show files used in loop mounts. Yes,
 long shot...

and don't forget dmsetup ls... (followed immediately by apt-get remove 
evms if you're on an unfortunate version of ubuntu which helpfully 
installed that partition-stealing service for you.)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-30 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

 On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote:
  On Sat, 29 Dec 2007, Dan Williams wrote:
 
   On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) 
on
the same 64k chunk array and had raised the stripe_cache_size to 1024...
and got a hang.  this time i grabbed stripe_cache_active before bumping
the size again -- it was only 905 active.  as i recall the bug we were
debugging a year+ ago the active was at the size when it would hang.  so
this is probably something new.
  
   I believe I am seeing the same issue and am trying to track down
   whether XFS is doing something unexpected, i.e. I have not been able
   to reproduce the problem with EXT3.  MD tries to increase throughput
   by letting some stripe work build up in batches.  It looks like every
   time your system has hung it has been in the 'inactive_blocked' state
   i.e.  3/4 of stripes active.  This state should automatically
   clear...
 
  cool, glad you can reproduce it :)
 
  i have a bit more data... i'm seeing the same problem on debian's
  2.6.22-3-amd64 kernel, so it's not new in 2.6.24.
 
 
 This is just brainstorming at this point, but it looks like xfs can
 submit more requests in the bi_end_io path such that it can lock
 itself out of the RAID array.  The sequence that concerns me is:
 
 return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang
 
 I need verify whether this path is actually triggering, but if we are
 in an inactive_blocked condition this new request will be put on a
 wait queue and we'll never get to the release_stripe() call after
 return_io().  It would be interesting to see if this is new XFS
 behavior in recent kernels.


i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

which was Neil's change in 2.6.22 for deferring generic_make_request
until there's enough stack space for it.

with my git tree sync'd to that commit my test cases fail in under 20
minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous
to it i've got 8h of run-time now without the problem.

this isn't definitive of course since it does seem to be timing
dependent, but since all failures have occured much earlier than that
for me so far i think this indicates this change is either the cause of
the problem or exacerbates an existing raid5 problem.

given that this problem looks like a very rare problem i saw with 2.6.18
(raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an
existing problem... not that i have evidence either way.

i've attached a new kernel log with a hang at d89d87965d... and the
reduced config file i was using for the bisect.  hopefully the hang
looks the same as what we were seeing at 2.6.24-rc6.  let me know.

-dean

kern.log.d89d87965d.bz2
Description: Binary data


config-2.6.21-b1.bz2
Description: Binary data


Re: [patch] improve stripe_cache_size documentation

2007-12-30 Thread dean gaudet
On Sun, 30 Dec 2007, Thiemo Nagel wrote:

 stripe_cache_size  (currently raid5 only)
 
 As far as I have understood, it applies to raid6, too.

good point... and raid4.

here's an updated patch.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-30 10:16:58.0 -0800
@@ -435,8 +435,14 @@
 
 These currently include
 
-  stripe_cache_size  (currently raid5 only)
+  stripe_cache_size  (raid4, raid5 and raid6)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
-  strip_cache_active (currently raid5 only)
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
+  strip_cache_active (raid4, raid5 and raid6)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] improve stripe_cache_size documentation

2007-12-30 Thread dean gaudet
On Sun, 30 Dec 2007, dean gaudet wrote:

 On Sun, 30 Dec 2007, Thiemo Nagel wrote:
 
  stripe_cache_size  (currently raid5 only)
  
  As far as I have understood, it applies to raid6, too.
 
 good point... and raid4.
 
 here's an updated patch.

and once again with a typo fix.  oops.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-30 14:30:40.0 -0800
@@ -435,8 +435,14 @@
 
 These currently include
 
-  stripe_cache_size  (currently raid5 only)
+  stripe_cache_size  (raid4, raid5 and raid6)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
-  strip_cache_active (currently raid5 only)
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
+  stripe_cache_active (raid4, raid5 and raid6)
   number of active entries in the stripe cache

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on 
the same 64k chunk array and had raised the stripe_cache_size to 1024... 
and got a hang.  this time i grabbed stripe_cache_active before bumping 
the size again -- it was only 905 active.  as i recall the bug we were 
debugging a year+ ago the active was at the size when it would hang.  so 
this is probably something new.

anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to 
hit that limit too if i try harder :)

btw what units are stripe_cache_size/active in?  is the memory consumed 
equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * 
raid_disks * stripe_cache_active)?

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hmm this seems more serious... i just ran into it with chunksize 64KiB and 
 while just untarring a bunch of linux kernels in parallel... increasing 
 stripe_cache_size did the trick again.
 
 -dean
 
 On Thu, 27 Dec 2007, dean gaudet wrote:
 
  hey neil -- remember that raid5 hang which me and only one or two others 
  ever experienced and which was hard to reproduce?  we were debugging it 
  well over a year ago (that box has 400+ day uptime now so at least that 
  long ago :)  the workaround was to increase stripe_cache_size... i seem to 
  have a way to reproduce something which looks much the same.
  
  setup:
  
  - 2.6.24-rc6
  - system has 8GiB RAM but no swap
  - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
  - mkfs.xfs default options
  - mount -o noatime
  - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
  
  that sequence hangs for me within 10 seconds... and i can unhang / rehang 
  it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
  by watching iostat -kx /dev/sd? 5.
  
  i've attached the kernel log where i dumped task and timer state while it 
  was hung... note that you'll see at some point i did an xfs mount with 
  external journal but it happens with internal journal as well.
  
  looks like it's using the raid456 module and async api.
  
  anyhow let me know if you need more info / have any suggestions.
  
  -dean
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread dean gaudet
On Tue, 25 Dec 2007, Bill Davidsen wrote:

 The issue I'm thinking about is hardware sector size, which on modern drives
 may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
 when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
  131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
  132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
  130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
  131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
  131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
  131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
  130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
  131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
  132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
  131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
  133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
  130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
  130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
  132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
  131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
  129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

 On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
  hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
  the same 64k chunk array and had raised the stripe_cache_size to 1024...
  and got a hang.  this time i grabbed stripe_cache_active before bumping
  the size again -- it was only 905 active.  as i recall the bug we were
  debugging a year+ ago the active was at the size when it would hang.  so
  this is probably something new.
 
 I believe I am seeing the same issue and am trying to track down
 whether XFS is doing something unexpected, i.e. I have not been able
 to reproduce the problem with EXT3.  MD tries to increase throughput
 by letting some stripe work build up in batches.  It looks like every
 time your system has hung it has been in the 'inactive_blocked' state
 i.e.  3/4 of stripes active.  This state should automatically
 clear...

cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's 
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled 
so far -- a 2.6.19.7 kernel doesn't show the problem, and early 
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm 
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just 
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to 
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async 
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it 
takes about an hour to give me confidence there's no problems so this will 
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] improve stripe_cache_size documentation

2007-12-29 Thread dean gaudet
Document the amount of memory used by the stripe cache and the fact that 
it's tied down and unavailable for other purposes (right?).  thanks to Dan 
Williams for the formula.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-29 13:04:17.0 -0800
@@ -438,5 +438,11 @@
   stripe_cache_size  (currently raid5 only)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
   strip_cache_active (currently raid5 only)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Justin Piszcz wrote:

 Curious btw what kind of filesystem size/raid type (5, but defaults I assume,
 nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
 size/chunk size(s) are you using/testing with?

mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
mkfs.xfs -f /dev/md2

otherwise defaults

 The script you sent out earlier, you are able to reproduce it easily with 31
 or so kernel tar decompressions?

not sure, the point of the script is to untar more than there is RAM.  it 
happened with a single rsync running though -- 3.5M indoes from a remote 
box.  it also happens with the single 10GB dd write... although i've been 
using the tar method for testing different kernel revs.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, dean gaudet wrote:

 On Sat, 29 Dec 2007, Justin Piszcz wrote:
 
  Curious btw what kind of filesystem size/raid type (5, but defaults I 
  assume,
  nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
  size/chunk size(s) are you using/testing with?
 
 mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
 mkfs.xfs -f /dev/md2
 
 otherwise defaults

hmm i missed a few things, here's exactly how i created the array:

mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 
/dev/sd[a-h]1

it's reassembled automagically each reboot, but i do this each reboot:

mkfs.xfs -f /dev/md2
mount -o noatime /dev/md2 /mnt/new
./dma_thrasher linux.tar.gz /mnt/new

the --assume-clean and noatime probably make no difference though...

on the bisection front it looks like it's new behaviour between 2.6.21.7 
and 2.6.22.15 (stock kernels now, not debian).

i've got to step out for a while, but i'll go at it again later, probably 
with git bisect unless someone has some cherry picked changes to suggest.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
hmm this seems more serious... i just ran into it with chunksize 64KiB and 
while just untarring a bunch of linux kernels in parallel... increasing 
stripe_cache_size did the trick again.

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hey neil -- remember that raid5 hang which me and only one or two others 
 ever experienced and which was hard to reproduce?  we were debugging it 
 well over a year ago (that box has 400+ day uptime now so at least that 
 long ago :)  the workaround was to increase stripe_cache_size... i seem to 
 have a way to reproduce something which looks much the same.
 
 setup:
 
 - 2.6.24-rc6
 - system has 8GiB RAM but no swap
 - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
 - mkfs.xfs default options
 - mount -o noatime
 - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
 
 that sequence hangs for me within 10 seconds... and i can unhang / rehang 
 it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
 by watching iostat -kx /dev/sd? 5.
 
 i've attached the kernel log where i dumped task and timer state while it 
 was hung... note that you'll see at some point i did an xfs mount with 
 external journal but it happens with internal journal as well.
 
 looks like it's using the raid456 module and async api.
 
 anyhow let me know if you need more info / have any suggestions.
 
 -dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
On Thu, 27 Dec 2007, Justin Piszcz wrote:

 With that high of a stripe size the stripe_cache_size needs to be greater than
 the default to handle it.

i'd argue that any deadlock is a bug...

regardless i'm still seeing deadlocks with the default chunk_size of 64k 
and stripe_cache_size of 256... in this case it's with a workload which is 
untarring 34 copies of the linux kernel at the same time.  it's a variant 
of doug ledford's memtest, and i've attached it.

-dean#!/usr/bin/perl

# Copyright (c) 2007 dean gaudet [EMAIL PROTECTED]
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the Software),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.

# this idea shamelessly stolen from doug ledford

use warnings;
use strict;

# ensure stdout is not buffered
select(STDOUT); $| = 1;

my $usage = usage: $0 linux.tar.gz /path1 [/path2 ...]\n;
defined(my $tarball = shift) or die $usage;
-f $tarball or die $tarball does not exist or is not a file\n;

my @paths = @ARGV;
$#paths = 0 or die $usage;

# determine size of uncompressed tarball
open(GZIP, -|) || exec gzip, --quiet, --list, $tarball;
my $line = GZIP;
my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#;
defined($tarball_size) or die unexpected result from gzip --quiet --list 
$tarball\n;
close(GZIP);

# determine amount of memory
open(MEMINFO, /proc/meminfo)
or die unable to open /proc/meminfo for read: $!\n;
my $total_mem;
while (MEMINFO) {
  if (/^MemTotal:\s*(\d+)\s*kB/) {
$total_mem = $1;
last;
  }
}
defined($total_mem) or die did not find MemTotal line in /proc/meminfo\n;
close(MEMINFO);
$total_mem *= 1024;

print total memory: $total_mem\n;
print uncompressed tarball: $tarball_size\n;
my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size);
print nr simultaneous processes: $nr_simultaneous\n;

sub system_or_die {
  my @args = @_;
  system(@args);
  if ($? == 1) {
my $msg = sprintf(%s failed to exec %s: $!\n, scalar(localtime), 
$args[0]);
  }
  elsif ($?  127) {
my $msg = sprintf(%s %s died with signal %d, %s coredump\n,
scalar(localtime), $args[0], ($?  127), ($?  128) ? with : 
without);
die $msg;
  }
  elsif (($?  8) != 0) {
my $msg = sprintf(%s %s exited with non-zero exit code %d\n,
scalar(localtime), $args[0], $?  8);
die $msg;
  }
}

sub untar($) {
  mkdir($_[0]) or die localtime(). unable to mkdir($_[0]): $!\n;
  system_or_die(tar, -xzf, $tarball, -C, $_[0]);
}

print localtime(). untarring golden copy\n;
my $golden = $paths[0]./dma_tmp.$$.gold;
untar($golden);

my $pass_no = 0;
while (1) {
  print localtime(). pass $pass_no: extracting\n;
  my @outputs;
  foreach my $n (1..$nr_simultaneous) {
# treat paths in a round-robin manner
my $dir = shift(@paths);
push(@paths, $dir);

$dir .= /dma_tmp.$$.$n;
push(@outputs, $dir);

my $pid = fork;
defined($pid) or die localtime(). unable to fork: $!\n;
if ($pid == 0) {
  untar($dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  print localtime(). pass $pass_no: diffing\n;
  foreach my $dir (@outputs) {
my $pid = fork;
defined($pid) or die localtime(). unable to fork: $!\n;
if ($pid == 0) {
  system_or_die(diff, -U, 3, -rN, $golden, $dir);
  system_or_die(rm, -fr, $dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  ++$pass_no;
}


Re: external bitmaps.. and more

2007-12-11 Thread dean gaudet
On Thu, 6 Dec 2007, Michael Tokarev wrote:

 I come across a situation where external MD bitmaps
 aren't usable on any standard linux distribution
 unless special (non-trivial) actions are taken.
 
 First is a small buglet in mdadm, or two.
 
 It's not possible to specify --bitmap= in assemble
 command line - the option seems to be ignored.  But
 it's honored when specified in config file.

i think neil fixed this at some point -- i ran into it / reported 
essentially the same problems here a while ago.


 The thing is that when a external bitmap is being used
 for an array, and that bitmap resides on another filesystem,
 all common distributions fails to start/mount and to
 shutdown/umount arrays/filesystems properly, because
 all starts/stops is done in one script, and all mounts/umounts
 in another, but for bitmaps to work the two should be intermixed
 with each other.

so i've got a debian unstable box which has uptime 402 days (to give you 
an idea how long ago i last tested the reboot sequence).  it has raid1 
root and raid5 /home.  /home has an external bitmap on the root partition.

i have /etc/default/mdadm set with INITRDSTART to start only the root 
raid1 during initrd... this manages to work out later when the external 
bitmap is required.

but it is fragile... and i think it's only possible to get things to work 
with an initrd and the external bitmap on the root fs or by having custom 
initrd and/or rc.d scripts.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid array is not automatically detected.

2007-07-17 Thread dean gaudet


On Mon, 16 Jul 2007, David Greaves wrote:

 Bryan Christ wrote:
  I do have the type set to 0xfd.  Others have said that auto-assemble only
  works on RAID 0 and 1, but just as Justin mentioned, I too have another box
  with RAID5 that gets auto assembled by the kernel (also no initrd).  I
  expected the same behavior when I built this array--again using mdadm
  instead of raidtools.
 
 Any md arrays with partition type 0xfd using a 0.9 superblock should be
 auto-assembled by a standard kernel.

no... debian (and probably ubuntu) do not build md into the kernel, they 
build it as a module, and the module does not auto-detect 0xfd.  i don't 
know anything about slackware, but i just felt it worth commenting that a 
standard kernel is not really descriptive enough.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS sunit/swidth for raid10

2007-03-25 Thread dean gaudet
On Fri, 23 Mar 2007, Peter Rabbitson wrote:

 dean gaudet wrote:
  On Thu, 22 Mar 2007, Peter Rabbitson wrote:
  
   dean gaudet wrote:
On Thu, 22 Mar 2007, Peter Rabbitson wrote:

 Hi,
 How does one determine the XFS sunit and swidth sizes for a software
 raid10
 with 3 copies?
mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from
software raid and select an appropriate sunit/swidth...

although i'm not sure i agree entirely with its choice for raid10:
   So do I, especially as it makes no checks for the amount of copies (3 in
   my
   case, not 2).
   
it probably doesn't matter.
   This was essentially my question. For an array -pf3 -c1024 I get swidth =
   4 *
   sunit = 4MiB. Is it about right and does it matter at all?
  
  how many drives?
  
 
 Sorry. 4 drives, 3 far copies (so any 2 drives can fail), 1M chunk.

my mind continues to be blown by linux raid10.

so that's like raid1 on 4 disks except the copies are offset by 1/4th of 
the disk?

i think swidth = 4*sunit is the right config then -- 'cause a read of 4MiB 
will stride all 4 disks...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS sunit/swidth for raid10

2007-03-22 Thread dean gaudet
On Thu, 22 Mar 2007, Peter Rabbitson wrote:

 dean gaudet wrote:
  On Thu, 22 Mar 2007, Peter Rabbitson wrote:
  
   Hi,
   How does one determine the XFS sunit and swidth sizes for a software
   raid10
   with 3 copies?
  
  mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from
  software raid and select an appropriate sunit/swidth...
  
  although i'm not sure i agree entirely with its choice for raid10:
 
 So do I, especially as it makes no checks for the amount of copies (3 in my
 case, not 2).
 
  it probably doesn't matter.
 
 This was essentially my question. For an array -pf3 -c1024 I get swidth = 4 *
 sunit = 4MiB. Is it about right and does it matter at all?

how many drives?

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm: raid1 with ext3 - filesystem size differs?

2007-03-20 Thread dean gaudet
it looks like you created the filesystem on the component device before 
creating the raid.

-dean

On Fri, 16 Mar 2007, Hanno Meyer-Thurow wrote:

 Hi all!
 Please CC me on answers since I am not subscribed to this list, thanks.
 
 When I try to build a raid1 system with mdadm 2.6.1 the filesystem size
 recorded in superblock differs from physical size of device.
 
 System:
 ana ~ # uname -a
 Linux ana 2.6.20-gentoo-r2 #4 SMP PREEMPT Sat Mar 10 16:25:46 CET 2007 x86_64 
 Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz GenuineIntel GNU/Linux
 
 ana ~ # mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
 mdadm: /dev/sda1 appears to contain an ext2fs file system
 size=48152K  mtime=Thu Mar 15 17:27:07 2007
 mdadm: /dev/sda1 appears to be part of a raid array:
 level=raid1 devices=2 ctime=Thu Mar 15 17:25:52 2007
 mdadm: /dev/sdb1 appears to contain an ext2fs file system
 size=48152K  mtime=Thu Mar 15 17:27:07 2007
 mdadm: /dev/sdb1 appears to be part of a raid array:
 level=raid1 devices=2 ctime=Thu Mar 15 17:25:52 2007
 Continue creating array? y
 
 mdadm: array /dev/md1 started.
 ana ~ # cat /proc/mdstat
 md1 : active raid1 sdb1[1] sda1[0]
   48064 blocks [2/2] [UU]
 
 ana ~ # mdadm --misc --detail /dev/md1
 /dev/md1:
 Version : 00.90.03
   Creation Time : Thu Mar 15 17:37:35 2007
  Raid Level : raid1
  Array Size : 48064 (46.95 MiB 49.22 MB)
   Used Dev Size : 48064 (46.95 MiB 49.22 MB)
Raid Devices : 2
   Total Devices : 2
 Preferred Minor : 1
 Persistence : Superblock is persistent
 
 Update Time : Thu Mar 15 17:38:27 2007
   State : clean
  Active Devices : 2
 Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0
 
UUID : cf0478ee:7e60a40e:20a5e204:cc7bc2c9
  Events : 0.4
 
 Number   Major   Minor   RaidDevice State
0   810  active sync   /dev/sda1
1   8   171  active sync   /dev/sdb1
 
 ana ~ # LC_ALL=C fsck.ext3 /dev/md1
 e2fsck 1.39 (29-May-2006)
 The filesystem size (according to the superblock) is 48152 blocks
 The physical size of the device is 48064 blocks
 Either the superblock or the partition table is likely to be corrupt!
 Aborty? yes
 
 
 
 Any ideas what could be wrong? Thank you in advance for help!
 
 
 Regards,
 Hanno
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replace drive in RAID5 without losing redundancy?

2007-03-05 Thread dean gaudet


On Tue, 6 Mar 2007, Neil Brown wrote:

 On Monday March 5, [EMAIL PROTECTED] wrote:
  
  Is it possible to mark a disk as to be replaced by an existing spare,
  then migrate to the spare disk and kick the old disk _after_ migration
  has been done? Or not even kick - but mark as new spare.
 
 No, this is not possible yet.
 You can get nearly all the way there by:
 
   - add an internal bitmap.
   - fail one drive
   - --build a raid1 with that drive (and the other missing)
   - re-add the raid1 into the raid5
   - add the new drive to the raid1
   - wait for resync

i have an example at 
http://arctic.org/~dean/proactive-raid5-disk-replacement.txt... plus 
discussion as to why this isn't the best solution.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID Bitmap Question

2007-02-28 Thread dean gaudet
On Mon, 26 Feb 2007, Neil Brown wrote:

 On Sunday February 25, [EMAIL PROTECTED] wrote:
  I believe Neil stated that using bitmaps does incur a 10% performance 
  penalty.  If one's box never (or rarely) crashes, is a bitmap needed?
 
 I think I said it can incur such a penalty.  The actual cost is very
 dependant on work-load.

i did a crude benchmark recently... to get some data for a common setup
i use (external journals and bitmaps on raid1, xfs fs on raid5).

emphasis on crude:

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1] 

system:
- dual opteron 848 (2.2ghz), 8GiB ddr 266
- tyan s2882
- 2.6.20

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reshaping raid0/10

2007-02-21 Thread dean gaudet
On Thu, 22 Feb 2007, Neil Brown wrote:

 On Wednesday February 21, [EMAIL PROTECTED] wrote:
  Hello,
  
  
  
  are there any plans to support reshaping
  on raid0 and raid10?
  
 
 No concrete plans.  It largely depends on time and motivation.
 I expect that the various flavours of raid5/raid6 reshape will come
 first.
 Then probably converting raid0-raid5.
 
 I really haven't given any thought to how you might reshape a
 raid10...

i've got a 4x250 near2 i want to turn into a 4x750 near2.  i was 
considering doing straight dd from each of the 250 to the respective 750 
then doing an mdadm --create on the 750s (in the same ordering as the 
original array)... so i'd end up with a new array with more stripes.  it 
seems like this should work.

the same thing should work for all nearN with a multiple of N disks... and 
offsetN should work as well right?  but farN sounds like a nightmare.

if we had a generic proactive disk replacement method it could handle 
the 4x250-4x750 step.  (i haven't decided yet if i want to try my hacky 
bitmap method of doing proactive replacement... i'm not sure what'll 
happen if i add a 750GB disk back into an array with 250s... i suppose 
it'll work... i'll have to experiment.)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md autodetect only detects one disk in raid1

2007-01-27 Thread dean gaudet
take a look at your mdadm.conf ... both on your root fs and in your 
initrd... look for a DEVICES line and make sure it says DEVICES 
partitions... anything else is likely to cause problems like below.

also make sure each array is specified by UUID rather than device.

and then rebuild your initrd.  (dpkg-reconfigure linux-image-`uname -r` on 
debuntu).

that something else in the system claim use of the device problem makes 
me guess you're on ubuntu pre-edgy... where for whatever reason they 
included evms in the default install and for whatever inane reason evms 
steals every damn device in the system when it starts up.  
uninstall/deactivate evms if you're not using it.

-dean

On Sat, 27 Jan 2007, kenneth johansson wrote:

 I run raid1 on my root partition /dev/md0. Now I had a bad disk so I had
 to replace it but did not notice until I got home that I got a SATA
 instead of a PATA. Since I had a free sata interface I just put in in
 that. I had no problem adding the disk to the raid1 device that is until
 I rebooted the computer. 
 
 both the PATA disk and the SATA disk are detected before md start up the
 raid but only the PATA disk is activated. So the raid device is always
 booting in degraded mode. since this is the root disk I use the
 autodetect feature with partition type fd.
 
 Also Something else in the system claim use of the device since I can
 not add the SATA disk after the system has done a complete boot. I guess
 it has something to do with device mapper and LVM that I also run on the
 data disks but I'm not sure. any tip on what it can be??
 
 If I add the SATA disk to md0 early enough in the boot it works but why
 is it not autodetected ?
 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad performance on RAID 5

2007-01-18 Thread dean gaudet
On Wed, 17 Jan 2007, Sevrin Robstad wrote:

 I'm suffering from bad performance on my RAID5.
 
 a echo check /sys/block/md0/md/sync_action
 
 gives a speed at only about 5000K/sec , and HIGH load average :
 
 # uptime
 20:03:55 up 8 days, 19:55,  1 user,  load average: 11.70, 4.04, 1.52

iostat -kx /dev/sd? 10  ... and sum up the total IO... 

also try increasing sync_speed_min/max

and a loadavg jump like that suggests to me you have other things 
competing for the disk at the same time as the check.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet
On Mon, 15 Jan 2007, Robin Bowes wrote:

 I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
 where a drive has failed in a RAID5+1 array and a second has failed
 during the rebuild after the hot-spare had kicked in.

if the failures were read errors without losing the entire disk (the 
typical case) then new kernels are much better -- on read error md will 
reconstruct the sectors from the other disks and attempt to write it back.

you can also run monthly checks...

echo check /sys/block/mdX/md/sync_action

it'll read the entire array (parity included) and correct read errors as 
they're discovered.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet
On Mon, 15 Jan 2007, berk walker wrote:

 dean gaudet wrote:
  echo check /sys/block/mdX/md/sync_action
  
  it'll read the entire array (parity included) and correct read errors as
  they're discovered.

 
 Could I get a pointer as to how I can do this check in my FC5 [BLAG] system?
 I can find no appropriate check, nor md available to me.  It would be a
 good thing if I were able to find potentially weak spots, rewrite them to
 good, and know that it might be time for a new drive.
 
 All of my arrays have drives of approx the same mfg date, so the possibility
 of more than one showing bad at the same time can not be ignored.

it should just be:

echo check /sys/block/mdX/md/sync_action

if you don't have a /sys/block/mdX/md/sync_action file then your kernel is 
too old... or you don't have /sys mounted... (or you didn't replace X with 
the raid number :)

iirc there were kernel versions which had the sync_action file but didn't 
yet support the check action (i think possibly even as recent as 2.6.17 
had a small bug initiating one of the sync_actions but i forget which 
one).  if you can upgrade to 2.6.18.x it should work.

debian unstable (and i presume etch) will do this for all your arrays 
automatically once a month.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet
On Mon, 15 Jan 2007, Mr. James W. Laferriere wrote:

   Hello Dean ,
 
 On Mon, 15 Jan 2007, dean gaudet wrote:
 ...snip...
  it should just be:
  
  echo check /sys/block/mdX/md/sync_action
  
  if you don't have a /sys/block/mdX/md/sync_action file then your kernel is
  too old... or you don't have /sys mounted... (or you didn't replace X with
  the raid number :)
  
  iirc there were kernel versions which had the sync_action file but didn't
  yet support the check action (i think possibly even as recent as 2.6.17
  had a small bug initiating one of the sync_actions but i forget which
  one).  if you can upgrade to 2.6.18.x it should work.
  
  debian unstable (and i presume etch) will do this for all your arrays
  automatically once a month.
  
  -dean
 
   Being able to run a 'check' is a good thing (tm) .  But without a
 method to acquire statii  data back from the check ,  Seems rather bland .
 Is there a tool/file to poll/... where data  statii can be acquired ?

i'm not 100% certain what you mean, but i generally just monitor dmesg for 
the md read error message (mind you the message pre-2.6.19 or .20 isn't 
very informative but it's obvious enough).

there is also a file mismatch_cnt in the same directory as sync_action ... 
the Documentation/md.txt (in 2.6.18) refers to it incorrectly as 
mismatch_count... but anyhow why don't i just repaste the relevant portion 
of md.txt.

-dean

...

Active md devices for levels that support data redundancy (1,4,5,6)
also have

   sync_action
 a text file that can be used to monitor and control the rebuild
 process.  It contains one word which can be one of:
   resync- redundancy is being recalculated after unclean
   shutdown or creation
   recover   - a hot spare is being built to replace a
   failed/missing device
   idle  - nothing is happening
   check - A full check of redundancy was requested and is
   happening.  This reads all block and checks
   them. A repair may also happen for some raid
   levels.
   repair- A full check and repair is happening.  This is
   similar to 'resync', but was requested by the
   user, and the write-intent bitmap is NOT used to
   optimise the process.

  This file is writable, and each of the strings that could be
  read are meaningful for writing.

   'idle' will stop an active resync/recovery etc.  There is no
   guarantee that another resync/recovery may not be automatically
   started again, though some event will be needed to trigger
   this.
'resync' or 'recovery' can be used to restart the
   corresponding operation if it was stopped with 'idle'.
'check' and 'repair' will start the appropriate process
   providing the current state is 'idle'.

   mismatch_count
  When performing 'check' and 'repair', and possibly when
  performing 'resync', md will count the number of errors that are
  found.  The count in 'mismatch_cnt' is the number of sectors
  that were re-written, or (for 'check') would have been
  re-written.  As most raid levels work in units of pages rather
  than sectors, this my be larger than the number of actual errors
  by a factor of the number of sectors in a page.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread dean gaudet
On Sat, 13 Jan 2007, Robin Bowes wrote:

 Bill Davidsen wrote:
 
  There have been several recent threads on the list regarding software
  RAID-5 performance. The reference might be updated to reflect the poor
  write performance of RAID-5 until/unless significant tuning is done.
  Read that as tuning obscure parameters and throwing a lot of memory into
  stripe cache. The reasons for hardware RAID should include performance
  of RAID-5 writes is usually much better than software RAID-5 with
  default tuning.
 
 Could you point me at a source of documentation describing how to
 perform such tuning?
 
 Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
 SATA card configured as a single RAID6 array (~3TB available space)

linux sw raid6 small write performance is bad because it reads the entire 
stripe, merges the small write, and writes back the changed disks.  
unlike raid5 where a small write can get away with a partial stripe read 
(i.e. the smallest raid5 write will read the target disk, read the parity, 
write the target, and write the updated parity)... afaik this optimization 
hasn't been implemented in raid6 yet.

depending on your use model you might want to go with raid5+spare.  
benchmark if you're not sure.

for raid5/6 i always recommend experimenting with moving your fs journal 
to a raid1 device instead (on separate spindles -- such as your root 
disks).

if this is for a database or fs requiring lots of small writes then 
raid5/6 are generally a mistake... raid10 is the only way to get 
performance.  (hw raid5/6 with nvram support can help a bit in this area, 
but you just can't beat raid10 if you need lots of writes/s.)

beyond those config choices you'll want to become friendly with /sys/block 
and all the myriad of subdirectories and options under there.

in particular:

/sys/block/*/queue/scheduler
/sys/block/*/queue/read_ahead_kb
/sys/block/*/queue/nr_requests
/sys/block/mdX/md/stripe_cache_size

for * = any of the component disks or the mdX itself...

some systems have an /etc/sysfs.conf you can place these settings in to 
have them take effect on reboot.  (sysfsutils package on debuntu)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-12 Thread dean gaudet
On Thu, 11 Jan 2007, James Ralston wrote:

 I'm having a discussion with a coworker concerning the cost of md's
 raid5 implementation versus hardware raid5 implementations.
 
 Specifically, he states:
 
  The performance [of raid5 in hardware] is so much better with the
  write-back caching on the card and the offload of the parity, it
  seems to me that the minor increase in work of having to upgrade the
  firmware if there's a buggy one is a highly acceptable trade-off to
  the increased performance.  The md driver still commits you to
  longer run queues since IO calls to disk, parity calculator and the
  subsequent kflushd operations are non-interruptible in the CPU.  A
  RAID card with write-back cache releases the IO operation virtually
  instantaneously.
 
 It would seem that his comments have merit, as there appears to be
 work underway to move stripe operations outside of the spinlock:
 
 http://lwn.net/Articles/184102/
 
 What I'm curious about is this: for real-world situations, how much
 does this matter?  In other words, how hard do you have to push md
 raid5 before doing dedicated hardware raid5 becomes a real win?

hardware with battery backed write cache is going to beat the software at 
small write traffic latency essentially all the time but it's got nothing 
to do with the parity computation.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a RAID1--superblock problems

2006-12-12 Thread dean gaudet
On Tue, 12 Dec 2006, Jonathan Terhorst wrote:

 I need to shrink a RAID1 array and am having trouble with the
 persistent superblock; namely, mdadm --grow doesn't seem to relocate
 it. If I downsize the array and then shrink the corresponding
 partitions, the array fails since the superblock (which is normally
 located near the end of the device) now lays outside of the
 partitions. Is there any easier way to deal with this than digging
 into the mdadm source, manually calculating the superblock offset and
 dd'ing it to the right spot?

i'd think it'd be easier to recreate the array using --assume-clean after 
the shrink.  for raid1 it's extra easy because you don't need to get the 
disk ordering correct.

in fact with raid1 you don't even need to use mdadm --grow... you could do 
something like the following (assuming you've already shrunk the 
filesystem):

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sda1
mdadm --zero-superblock /dev/sdb1
fdisk /dev/sda  ... shrink partition
fdisk /dev/sdb  ... shrink partition
mdadm --create --assume-clean --level=1 -n2 /dev/md0 /dev/sd[ab]1

heck that same technique works for raid0/4/5/6 and raid10 near and 
offset layouts as well, doesn't it?  raid10 far layout definitely 
needs blocks rearranged to shrink.  in these other modes you'd need to be 
careful about recreating the array with the correct ordering of disks.

the zero-superblock step is an important defense against future problems 
with assemble every array i find-types of initrds that are unfortunately 
becomming common (i.e. debian and ubuntu).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Observations of a failing disk

2006-11-27 Thread dean gaudet
On Tue, 28 Nov 2006, Richard Scobie wrote:

 Anyway, my biggest concern is why
 
 echo repair  /sys/block/md5/md/sync_action
 
 appeared to have no effect at all, when I understand that it should re-write
 unreadable sectors?

i've had the same thing happen on a seagate 7200.8 pata 400GB... and went 
through the same sequence of operations you described, and the dd fixed 
it.

one theory was that i lucked out and the pending sectors in the unused 
disk near the md superblock... but since that's in general only about 90KB 
of disk i was kind of skeptical.  it's certainly possible, but seems 
unlikely.

another theory is that a pending sector doesn't always result in a read 
error -- i.e. depending on temperature?  but the question is, why wouldn't 
the disk try rewriting it if it does get a successful read.

i wish hard drives were a little less voodoo.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1 (non) performance

2006-11-19 Thread dean gaudet
On Wed, 15 Nov 2006, Magnus Naeslund(k) wrote:

 # cat /proc/mdstat
 Personalities : [raid1]
 md2 : active raid1 sda3[0] sdb3[1]
   236725696 blocks [2/2] [UU]
 
 md1 : active raid1 sda2[0] sdb2[1]
   4192896 blocks [2/2] [UU]
 
 md0 : active raid1 sda1[0] sdb1[1]
   4192832 blocks [2/2] [UU]

i see you have split /var and / on the same spindle... if your /home is on 
/ then you're causing extra seek action by having two active filesystems 
on the same spindles.  another option to consider is to make / small and 
mostly read-only and move /home to /var/home (and use a symlink or mount 
--bind to place it at /home).

or just put everything in one big / filesystem.

hopefully your swap isn't being used much anyhow.

try iostat -kx /dev/sd* 5 and see if the split is causing you troubles 
-- i/o activity on more than one partition at once.


 I've tried to modify the queuing by doing this, to disable the write cache 
 and enable CFQ. The CFQ choice is rather random.
 
 for disk in sda sdb; do
   blktool /dev/$disk wcache off
   hdparm -q -W 0 /dev/$disk

turning off write caching is a recipe for disasterous performance on most 
ata disks... unfortunately.  better to buy a UPS and set up nut or apcupsd 
or something to handle shutdown.  or just take your chances.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: safest way to swap in a new physical disk

2006-11-18 Thread dean gaudet
On Tue, 14 Nov 2006, Will Sheffler wrote:

 Hi.
 
 What is the safest way to switch out a disk in a software raid array created
 with mdadm? I'm not talking about replacing a failed disk, I want to take a
 healthy disk in the array and swap it for another physical disk. Specifically,
 I have an array made up of 10 250gb software-raid partitions on 8 300gb disks
 and 2 250gb disks, plus a hot spare. I want to switch the 250s to new 300gb
 disks so everything matches. Is there a way to do this without risking a
 rebuild? I can't back everything up, so I want to be as risk-free as possible.
 
 I guess what I want is to do something like this:
 
 (1) Unmount the array
 (2) Un-create the array
 (3) Somehow exactly duplicate partition X to a partition Y on a new disk
 (4) Re-create array with X gone and Y in it's place
 (5) Check if the array is OK without changing/activating it
 (6) If there is a problem, switch from Y back to X and have it as though
 nothing changed
 
 The part I'm worried about is (3), as I've tried duplicating partition images
 before and it never works right. Is there a way to do this with mdadm?

if you have a recent enough kernel (2.6.15 i think) and recent enough 
mdadm (2.2.x i think) you can do this all online without losing redundancy 
for more than a few seconds... i placed a copy of instructions and further 
discussions of what types of problems this method has here:

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

it's actually perfect for your situation.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-11-15 Thread dean gaudet
and i haven't seen it either... neil do you think your latest patch was 
hiding the bug?  'cause there was an iteration of an earlier patch which 
didn't produce much spam in dmesg but the bug was still there, then there 
is the version below which spams dmesg a fair amount but i didn't see the 
bug in ~30 days.

btw i've upgraded that box to 2.6.18.2 without the patch (it had some 
conflicts)... haven't seen the bug yet though (~10 days so far).

hmm i wonder if i could reproduce it more rapidly if i lowered 
/sys/block/mdX/md/stripe_cache_size.  i'll give that a go.

-dean


On Tue, 14 Nov 2006, Chris Allen wrote:

 You probably guessed that no matter what I did, I never, ever saw the problem
 when your
 trace was installed. I'd guess at some obscure timing-related problem. I can
 still trigger it
 consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are
 turned on.
 
 
 
 Neil Brown wrote:
  On Tuesday October 10, [EMAIL PROTECTED] wrote:

   Very happy to. Let me know what you'd like me to do.
   
  
  Cool thanks.
  
  At the end is a patch against 2.6.17.11, though it should apply against
  any later 2.6.17 kernel.
  Apply this and reboot.
  
  Then run
  
 while true
 do cat /sys/block/mdX/md/stripe_cache_active
sleep 10
 done  /dev/null
  
  (maybe write a little script or whatever).  Leave this running. It
  effects the check for has raid5 hung.  Make sure to change mdX to
  whatever is appropriate.
  
  Occasionally look in the kernel logs for
 plug problem:
  
  if you find that, send me the surrounding text - there should be about
  a dozen lines following this one.
  
  Hopefully this will let me know which is last thing to happen: a plug
  or an unplug.
  If the last is a plug, then the timer really should still be
  pending, but isn't (this is impossible).  So I'll look more closely at
  that option.
  If the last is an unplug, then the 'Plugged' flag should really be
  clear but it isn't (this is impossible).  So I'll look more closely at
  that option.
  
  Dean is running this, but he only gets the hang every couple of
  weeks.  If you get it more often, that would help me a lot.
  
  Thanks,
  NeilBrown
  
  
  diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c
  --- ./.patches/orig/block/ll_rw_blk.c   2006-08-21 09:52:46.0 
  +1000
  +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.0 +1000
  @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_
* This is called with interrupts off and no requests on the queue and
* with the queue lock held.
*/
  +static atomic_t seq = ATOMIC_INIT(0);
   void blk_plug_device(request_queue_t *q)
   {
  WARN_ON(!irqs_disabled());
  @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q)
  return;
  if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) {
  +   q-last_plug = jiffies;
  +   q-plug_seq = atomic_read(seq);
  +   atomic_inc(seq);
  mod_timer(q-unplug_timer, jiffies + q-unplug_delay);
  blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
  -   }
  +   } else
  +   q-last_plug_skip = jiffies;
  +   if (!timer_pending(q-unplug_timer) 
  +   !q-unplug_work.pending)
  +   printk(Neither Timer or work are pending\n);
   }
EXPORT_SYMBOL(blk_plug_device);
  @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q)
   {
  WARN_ON(!irqs_disabled());
   -  if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags))
  +   if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) {
  +   q-last_unplug_skip = jiffies;
  return 0;
  +   }
  del_timer(q-unplug_timer);
  +   q-last_unplug = jiffies;
  +   q-unplug_seq = atomic_read(seq);
  +   atomic_inc(seq);
  +   if (test_bit(QUEUE_FLAG_PLUGGED, q-queue_flags))
  +   printk(queue still (or again) plugged\n);
  return 1;
   }
   @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc
   static void blk_unplug_work(void *data)
   {
  request_queue_t *q = data;
  -
  +   q-last_unplug_work = jiffies;
  blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
  q-rq.count[READ] + q-rq.count[WRITE]);
   @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned
  blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
  q-rq.count[READ] + q-rq.count[WRITE]);
   +  q-last_unplug_timeout = jiffies;
  kblockd_schedule_work(q-unplug_work);
   }
   
  diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c
  --- ./.patches/orig/drivers/md/raid1.c  2006-08-10 17:28:01.0
  +1000
  +++ ./drivers/md/raid1.c2006-09-04 21:58:31.0 +1000
  @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev)
  d = conf-raid_disks;
  d--;
  rdev = 

Re: RAID5 array showing as degraded after motherboard replacement

2006-11-07 Thread dean gaudet
On Wed, 8 Nov 2006, James Lee wrote:

  However I'm still seeing the error messages in my dmesg (the ones I
  posted earlier), and they suggest that there is some kind of hardware
  fault (based on a quick Google of the error codes).  So I'm a little
  confused.

the fact that the error is in a geometry command really makes me wonder...

did you compare the number of blocks on the device vs. what seems to be 
available when it's on the weird raid card?

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is my RAID broken?

2006-11-06 Thread dean gaudet
On Mon, 6 Nov 2006, Mikael Abrahamsson wrote:

 On Mon, 6 Nov 2006, Neil Brown wrote:
 
  So it looks like you machine recently crashed (power failure?) and it is
  restarting.
 
 Or upgrade some part of the OS and now it'll do resync every week or so (I
 think this is debian default nowadays, don't know the interval though).

it should be only once a month... and it's just a check -- it reads 
everything and corrects errors.

i think it's a great thing actually... way more useful than smart long 
self-tests because md can reconstruct read errors immediately -- before 
you lose redundancy in that stripe.

-dean

% cat /etc/cron.d/mdadm
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft [EMAIL PROTECTED]
# distributed under the terms of the Artistic Licence 2.0
#
# $Id: mdadm.cron.d 147 2006-08-30 09:26:11Z madduck $
#

# By default, run at 01:06 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
6 1 * * 0 root [ -x /usr/share/mdadm/checkarray ]  [ $(date +\%d) -le 7 ]  
/usr/share/mdadm/checkarray --cron --all --quiet

Re: mdadm 2.5.5 external bitmap assemble problem

2006-11-06 Thread dean gaudet
On Mon, 6 Nov 2006, Neil Brown wrote:

  hey i have another related question... external bitmaps seem to pose a bit 
  of a chicken-and-egg problem.  all of my filesystems are md devices. with 
  an external bitmap i need at least one of the arrays to start, then have 
  filesystems mounted, then have more arrays start... it just happens to 
  work OK if i let debian unstale initramfs try to start all my arrays, 
  it'll fail for the ones needing bitmap.  then later /etc/init.d/mdadm-raid 
  should start the array.  (well it would if the bitmap= in mdadm.conf 
  worked :)
  
  is it possible to put bitmaps on devices instead of files?  mdadm seems to 
  want a --force for that (because the device node exists already) and i 
  haven't tried forcing it.  although i suppose a 200KB partition would be 
  kind of tiny but i could place the bitmap right beside the external 
  transaction log for the filesystem on the raid5.
 
 Create the root filesystem with --bitmap=internal, and store all the
 other bitmaps on that filesystem maybe?

yeah i only have the one external bitmap (it's for a large raid5)... so 
things will work fine once i apply your patch.  thanks.

 I don't know if it would work to have a bitmap on a device, but you
 can always mkfs the device, mount it, and put a bitmap on a file
 there??

yeah this was the first thing i tried after i found mdadm -b /dev/foo 
wasn't accepted...

without modifying startup scripts there's no way to use any filesystem 
other than root... it's just due to ordering of init scripts:

# ls /etc/rcS.d | grep -i 'mount\|raid'
S02mountkernfs.sh
S04mountdevsubfs.sh
S25mdadm-raid
S35mountall.sh
S36mountall-bootclean.sh
S45mountnfs.sh
S46mountnfs-bootclean.sh

i'd need to run another mdadm-raid after the S35mountall, and then another 
mountall.

anyhow, i don't think you need to change anything (except maybe a note in 
the docs somewhere), i'm just bringing it up as part of the experience of 
trying external bitmap.  i suspect that in the wild and crazy direction 
debian and ubuntu are heading (ditching sysvinit for event-driven systems) 
it'll be easy to express the boot dependencies.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-06 Thread dean gaudet


On Mon, 6 Nov 2006, James Lee wrote:

 Thanks for the reply Dean.  I looked through dmesg output from the
 boot up, to check whether this was just an ordering issue during the
 system start up (since both evms and mdadm attempt to activate the
 array, which could cause things to go wrong...).
 
 Looking through the dmesg output though, it looks like the 'missing'
 disk is being detected before the array is assembled, but that the
 disk is throwing up errors.  I've attached the full output of dmesg;
 grepping it for hde gives the following:
 
 [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings:
 hde:DMA, hdf:DMA
 [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive
 [17179575.312000] hde: max request size: 512KiB
 [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA
 [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady
 SeekComplete Error }
 [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError }
 [17179575.312000] hde: cache flushes supported

is it possible that the NetCell SyncRAID implementation is stealing some 
of the sectors (even though it's marked JBOD)?  anyhow it could be the 
disk is bad, but i'd still be tempted to see if the problem stays with the 
controller if you swap the disk with another in the array.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.

2006-11-06 Thread dean gaudet
On Mon, 6 Nov 2006, Neil Brown wrote:

 This creates a deep disconnect between udev and md.
 udev expects a device to appear first, then it created the
 device-special-file in /dev.
 md expect the device-special-file to exist first, and then created the
 device on the first open.

could you create a special /dev/mdx device which is used to 
assemble/create arrays only?  i mean literally mdx not mdX where X is 
a number.  mdx would always be there if md module is loaded... so udev 
would see the driver appear and then create the /dev/mdx.  then mdadm 
would use /dev/mdx to do assemble/creates/whatever and cause other devices 
to appear/disappear in a manner which udev is happy with.

(much like how /dev/ptmx is used to create /dev/pts/N entries.)

doesn't help legacy mdadm binaries... but seems like it fits the New World 
Order.

or hm i suppose the New World Order is to eschew binary interfaces and 
suggest a /sys/class/md/ hierarchy with a bunch of files you have to splat 
ascii data into to cause an array to be created/assembled.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Checking individual drive state

2006-11-05 Thread dean gaudet
On Sun, 5 Nov 2006, Bradshaw wrote:

 I've recently built a smallish RAID5 box as a storage area for my home
 network, using mdadm. However, one of the drives will not remain in the array
 for longer that around two days before it is removed. Readding it to the array
 does not throw any errors, leading me to believe that it's probably a problem
 with the controller, which is an add-in SATA card, as well as the other drive
 connected to it failing once.
 
 I don't know how to scan the one disk for bad sectors, stopping the array and
 doing an fsck or similar throws errors, so I need help in determining whether
 the disc itself is faulty.

try swapping the cable first.  after that swap ports with another disk and 
see if the problem follows the port or the disk.

you can see if smartctl -a (from smartmontools) tells you anything 
interesting.  (it can be quite difficult, to impossible, to understand 
smartctl -a output though.  but if you've got errors in the SMART error 
log that's a good place to start.)


 If the controller is to be replaced, how would I go about migrating the two
 discs to the new controller whilst maintaining the array?

it depends on which method you're using to assemble the array at boot 
time.  in most cases if these aren't your root disks then a swap of two 
disks won't result in any troubles reassembling the array.  other device 
renames may cause problems depending on your distribution though -- but 
generally when two devices swap names within an array you should be fine.

you'll want to do the disk swap with the array offline (either shutdown 
the box or mdadm --stop the array).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-05 Thread dean gaudet
On Sun, 5 Nov 2006, James Lee wrote:

 Hi there,
 
 I'm running a 5-drive software RAID5 array across two controllers.
 The motherboard in that PC recently died - I sent the board back for
 RMA.  When I refitted the motherboard, connected up all the drives,
 and booted up I found that the array was being reported as degraded
 (though all the data on it is intact).  I have 4 drives on the on
 board controller and 1 drive on an XFX Revo 64 SATA controller card.
 The drive which is being reported as not being in the array is the one
 connected to the XFX controller.
 
 The OS can see that drive fine, and mdadm --examine on that drive
 shows that it is part of the array and that there are 5 active devices
 in the array.  Doing mdadm --examine on one of the other four drives
 shows that the array has 4 active drives and one failed.  mdadm
 --detail for the array also shows 4 active and one failed.

that means the array was assembled without the 5th disk and is currently 
degraded.


 Now I haven't lost any data here and I know I can just force a resync
 of the array which is fine.  However I'm concerned about how this has
 happened.  One worry is that the XFX SATA controller is doing
 something funny to the drive.  I've noticed that it's BIOS has
 defaulted to RAID0 mode (even though there's only one drive on it) - I
 can't see how this would cause any particular problems here though.  I
 guess it's possible that some data on the drive got corrupted when the
 motherboard failed...

no it's more likely the devices were renamed or the 5th device didn't come 
up before the array was assembled.

it's possible that a different bios setting lead to the device using a 
different driver than is in your initrd... but i'm just guessing.

 Any ideas what could cause mdadm to report as I've described above
 (I've attached the output of these three commands)?  I'm running
 Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1.  In case it's
 relevant here, I created the array using EVMS...

i've never created an array with evms... but my guess is that it may have 
used mapped device names instead of the normal device names.  take a 
look at /proc/mdstat and see what devices are in the array and use those 
as a template to find the name of the missing device.  below i'll use 
/dev/sde1 as the example missing device and /dev/md0 as the example array.

first thing i'd try is something like this:

mdadm /dev/md0 -a /dev/sde1

which hotadds the device into the array... which will start a resync.

when the resync is done (cat /proc/mdstat) do this.

mdadm -Gb internal /dev/md0

which will add write-intent bitmaps to your device... which will avoid 
another long wait for a resync after the next reboot if the fix below 
doesn't help.

then do this:

dpkg-reconfigure linux-image-`uname -r`

which will rebuild the initrd for your kernel ... and if it was a driver 
change this should include the new driver into the initrd.

then reboot and see if it comes up fine.  if it doesn't, you can repeat 
the -a /dev/sde1 command above... the resync will be quick this time due 
to the bitmap... and we'll have to investigate further.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm 2.5.5 external bitmap assemble problem

2006-11-04 Thread dean gaudet
i think i've got my mdadm.conf set properly for an external bitmap -- but 
it doesn't seem to work.  i can assemble from the command-line fine 
though:

# grep md4 /etc/mdadm/mdadm.conf
ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

# mdadm -A /dev/md4
mdadm: Could not open bitmap file

# mdadm -A --uuid=dbc3be0b:b5853930:a02e038c:13ba8cdc --bitmap=/bitmap.md4 
/dev/md4
mdadm: /dev/md4 has been started with 5 drives and 1 spare.

# mdadm --version
mdadm - v2.5.5 - 23 October 2006

(this is on debian unstale)

btw -- mdadm seems to create the bitmap file with world readable perms.
i doubt it matters, but 600 would seem like a better mode.

hey i have another related question... external bitmaps seem to pose a bit 
of a chicken-and-egg problem.  all of my filesystems are md devices. with 
an external bitmap i need at least one of the arrays to start, then have 
filesystems mounted, then have more arrays start... it just happens to 
work OK if i let debian unstale initramfs try to start all my arrays, 
it'll fail for the ones needing bitmap.  then later /etc/init.d/mdadm-raid 
should start the array.  (well it would if the bitmap= in mdadm.conf 
worked :)

is it possible to put bitmaps on devices instead of files?  mdadm seems to 
want a --force for that (because the device node exists already) and i 
haven't tried forcing it.  although i suppose a 200KB partition would be 
kind of tiny but i could place the bitmap right beside the external 
transaction log for the filesystem on the raid5.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5/10 chunk size and ext2/3 stride parameter

2006-11-04 Thread dean gaudet
On Sat, 4 Nov 2006, martin f krafft wrote:

 also sprach dean gaudet [EMAIL PROTECTED] [2006.11.03.2019 +0100]:
   I cannot find authoritative information about the relation between
   the RAID chunk size and the correct stride parameter to use when
   creating an ext2/3 filesystem.
  
  you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth 
  automatically from the underlying md device.
 
 i don't know enough about xfs to be able to agree or disagree with
 you on that.
 
  # mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes 
  /dev/md0 /dev/sd[abcd]1
  mdadm: array /dev/md0 started.
 
 with 64k chunks i assume...

yup.


  # mkfs.xfs /dev/md0
  meta-data=/dev/md0   isize=256agcount=32, agsize=9157232 
  blks
   =   sectsz=4096  attr=0
  data =   bsize=4096   blocks=293031424, imaxpct=25
   =   sunit=16 swidth=48 blks, unwritten=1
 
 sunit seems like the stride width i determined (64k chunks / 4k
 bzise), but what is swidth? Is it 64 * 3/4 because of the four
 device RAID5?

yup.

and for a raid6 mkfs.xfs correctly gets sunit=16 swidth=32.


  # mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean 
  --auto=yes /dev/md0 /dev/sd[abcd]1
  mdadm: array /dev/md0 started.
  # mkfs.xfs -f /dev/md0
  meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 
  blks
   =   sectsz=512   attr=0
  data =   bsize=4096   blocks=195354112, imaxpct=25
   =   sunit=16 swidth=64 blks, unwritten=1
 
 okay, so as before, 16 stride size and 64 stripe width, because
 we're now dealing with mirrors.
 
  # mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean 
  --auto=yes /dev/md0 /dev/sd[abcd]1
  mdadm: array /dev/md0 started.
  # mkfs.xfs -f /dev/md0
  meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 
  blks
   =   sectsz=512   attr=0
  data =   bsize=4096   blocks=195354112, imaxpct=25
   =   sunit=16 swidth=64 blks, unwritten=1
 
 why not? in this case, -n2 and -f2 aren't any different, are they?

they're different in that with f2 you get essentially 4 disk raid0 read 
performance because the copies of each byte are half a disk away... so it 
looks like a raid0 on the first half of the disks, and another raid0 on 
the second half.

in n2 the two copies are at the same offset... so it looks more like a 2 
disk raid0 for reading and writing.

i'm not 100% certain what xfs uses them for -- you can actually change the 
values at mount time.  so it probably uses them for either read scheduling 
or write layout or both.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5/10 chunk size and ext2/3 stride parameter

2006-11-03 Thread dean gaudet
On Tue, 24 Oct 2006, martin f krafft wrote:

 Hi,
 
 I cannot find authoritative information about the relation between
 the RAID chunk size and the correct stride parameter to use when
 creating an ext2/3 filesystem.

you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth 
automatically from the underlying md device.

for example, on a box i'm testing:

# mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes /dev/md0 
/dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=9157232 
blks
 =   sectsz=4096  attr=0
data =   bsize=4096   blocks=293031424, imaxpct=25
 =   sunit=16 swidth=48 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=2
 =   sectsz=4096  sunit=1 blks
realtime =none   extsz=196608 blocks=0, rtextents=0

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --zero-superblock /dev/sd[abcd]1
# mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean 
--auto=yes /dev/md0 /dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs -f /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 blks
 =   sectsz=512   attr=0
data =   bsize=4096   blocks=195354112, imaxpct=25
 =   sunit=16 swidth=64 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks
realtime =none   extsz=262144 blocks=0, rtextents=0


i wonder if the code could be copied into mkfs.ext3?

although hmm, i don't think it gets raid10 n2 correct:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --zero-superblock /dev/sd[abcd]1
# mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean 
--auto=yes /dev/md0 /dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs -f /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 blks
 =   sectsz=512   attr=0
data =   bsize=4096   blocks=195354112, imaxpct=25
 =   sunit=16 swidth=64 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks
realtime =none   extsz=262144 blocks=0, rtextents=0


in a near 2 layout i would expect sunit=16, swidth=32 ...  but swidth=64
probably doesn't hurt.


 My understanding is that (block * stride) == (chunk). So if I create
 a default RAID5/10 with 64k chunks, and create a filesystem with 4k
 blocks on it, I should choose stride 64k/4k = 16.

that's how i think it works -- i don't think ext[23] have a concept of stripe
width like xfs does.  they just want to know how to avoid putting all the
critical data on one disk (which needs only the chunk size).  but you should
probably ask on the linux-ext4 mailing list.

 Is the chunk size of an array equal to the stripe size? Or is it
 (n-1)*chunk size for RAID5 and (n/2)*chunk size for a plain near=2
 RAID10?

 Also, I understand that it makes no sense to use stride for RAID1 as
 there are no stripes in that sense. But for RAID10 it makes sense,
 right?

yep.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md array numbering is messed up

2006-10-30 Thread dean gaudet
On Mon, 30 Oct 2006, Brad Campbell wrote:

 Michael Tokarev wrote:
  My guess is that it's using mdrun shell script - the same as on Debian.
  It's a long story, the thing is quite ugly and messy and does messy things
  too, but they says it's compatibility stuff and continue shipping it.
...
 
 I'd suggest you are probably correct. By default on Ubuntu 6.06
 
 [EMAIL PROTECTED]:~$ cat /etc/init.d/mdadm-raid
 #!/bin/sh
 #
 # Start any arrays which are described in /etc/mdadm/mdadm.conf and which are
 # not running already.
 #
 # Copyright (c) 2001-2004 Mario Jou/3en [EMAIL PROTECTED]
 # Distributable under the terms of the GNU GPL version 2.
 
 MDADM=/sbin/mdadm
 MDRUN=/sbin/mdrun

fwiw mdrun is finally on its way out.  the debian unstable mdadm package 
is full of new goodness (initramfs goodness, 2.5.x mdadm featurefulness, 
monthly full array check goodness).  ubuntu folks should copy it again 
before they finalize edgy.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why partition arrays?

2006-10-24 Thread dean gaudet
On Tue, 24 Oct 2006, Bill Davidsen wrote:

 My read on LVM is that (a) it's one more thing for the admin to learn, (b)
 because it's seldom used the admin will be working from documentation if it
 has a problem, and (c) there is no bug-free software, therefore the use of LVM
 on top of RAID will be less reliable than a RAID-only solution. I can't
 quantify that, the net effect may be too small to measure. However, the cost
 and chance of a finger check from (a) and (b) are significant.

this is essentially why i gave up on LVM as well.

add in the following tidbits:

- snapshots stopped working in 2.6.  may be fixed by now, but i gave up 
hope and this was the biggest feature i desired from LVM.

- it's way better for performance to have only one active filesystem on a 
group of spindles

- you can emulate pvmove with md superblockless raid1 sufficiently well 
for most purposes (although as we've discussed here it would be nice if md 
directly supported proactive replacement)

and more i'm forgetting.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB and raid... Device names change

2006-09-18 Thread dean gaudet
On Tue, 19 Sep 2006, Eduardo Jacob wrote:

 DEVICE /dev/raid111 /dev/raid121
 ARRAY /dev/md0 level=raid1 num-devices=2 
 UUID=1369e13f:eb4fa45c:6d4b9c2a:8196aa1b

try using DEVICE partitions... then mdadm -As /dev/md0 will scan all 
available partitions for raid components with 
UUID=1369e13f:eb4fa45c:6d4b9c2a:8196aa1b.  so it won't matter which sdX 
they are.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: access *existing* array from knoppix

2006-09-12 Thread dean gaudet
On Tue, 12 Sep 2006, Dexter Filmore wrote:

 Am Dienstag, 12. September 2006 16:08 schrieb Justin Piszcz:
  /dev/MAKEDEV /dev/md0
 
  also make sure the SW raid modules etc are loaded if necessary.
 
 Won't work, MAKEDEV doesn't know how to create [/dev/]md0.

echo 'DEVICE partitions' /tmp/mdadm.conf
mdadm --detail --scan --config=/tmp/mdadm.conf /tmp/mdadm.conf

take a look in /tmp/mdadm.conf ... your root array should be listed.

mdadm --assemble --config=/tmp/mdadm.conf --auto=yes /dev/md0

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UUID's

2006-09-09 Thread dean gaudet
On Sat, 9 Sep 2006, Richard Scobie wrote:

 To remove all doubt about what is assembled where, I though going to:
 
 DEVICE partitions
 MAILADDR root
 ARRAY /dev/md3 UUID=xyz etc.
 
 would be more secure.
 
 Is this correct thinking on my part?

yup.

mdadm can generate it all for you... there's an example on the man page.  
basically you just want to paste the output of mdadm --detail --scan 
--config=partitions into your mdadm.conf.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Care and feeding of RAID?

2006-09-09 Thread dean gaudet
On Tue, 5 Sep 2006, Paul Waldo wrote:

 What about bitmaps?  Nobody has mentioned them.  It is my understanding that
 you just turn them on with mdadm /dev/mdX -b internal.  Any caveats for
 this?

bitmaps have been working great for me on a raid5 and raid1.  it makes it 
that much more tolerable when i accidentally crash the box and don't have 
to wait forever for a resync.

i don't notice the extra write traffic all that much... under heavy 
traffic i see about 3 writes/s to the spare disk in the raid5 -- i assume 
those are all due to the bitmap in the superblock on the spare.

i've considered using an external bitmap, i forget why i didn't do that 
initially.  the filesystem on the raid5 already has an external journal on 
raid1.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive-raid-disk-replacement

2006-09-08 Thread dean gaudet
On Fri, 8 Sep 2006, Michael Tokarev wrote:

 Recently Dean Gaudet, in thread titled 'Feature
 Request/Suggestion - Drive Linking', mentioned his
 document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
 
 I've read it, and have some umm.. concerns.  Here's why:
 
 
  mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
  mdadm /dev/md4 -r /dev/sdh1
  mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
  mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
  mdadm /dev/md4 --re-add /dev/md5
  mdadm /dev/md5 -a /dev/sdh1
 
  ... wait a few hours for md5 resync...
 
 And here's the problem.  While new disk, sdh1, are resynced from
 old, probably failing disk sde1, chances are high that there will
 be an unreadable block on sde1.  And this means the whole thing
 will not work -- md5 initially contained one working drive (sde1)
 and one spare (sdh1) which is being converted (resynced) to working
 disk.  But after read error on sde1, md5 will contain one failed
 drive and one spare -- for raid1 it's fatal combination.
 
 While at the same time, it's perfectly easy to reconstruct this
 failing block from other component devices of md4.

this statement is an argument for native support for this type of activity 
in md itself.

 That to say: this way of replacing disk in a software raid array
 isn't much better than just removing old drive and adding new one.

hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
no redundancy while you wait for the new disk to sync in the raid5.

in my proposal the probability that you'll retain redundancy through the 
entire process is non-zero.  we can debate how non-zero it is, but 
non-zero is greater than zero.

i'll admit it depends a heck of a lot on how long you wait to replace your 
disks, but i prefer to replace mine well before they get to the point 
where just reading the entire disk is guaranteed to result in problems.


 And if the drive you're replacing is failing (according to SMART
 for example), this method is more likely to fail.

my practice is to run regular SMART long self tests, which tend to find 
Current_Pending_Sectors (which are generally read errors waiting to 
happen) and then launch a repair sync action... that generally drops the 
Current_Pending_Sector back to zero.  either through a realloc or just 
simply rewriting the block.  if it's a realloc then i consider if there's 
enough of them to warrant replacing the disk...

so for me the chances of a read error while doing the raid1 thing aren't 
as high as they could be...

but yeah you've convinced me this solution isn't good enough.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UUID's

2006-09-08 Thread dean gaudet


On Sat, 9 Sep 2006, Richard Scobie wrote:

 If I have specified an array in mdadm.conf using UUID's:
 
 ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371
 
 and I replace a failed drive in the array, will the new drive be given the
 previous UUID, or do I need to upate the mdadm.conf entry?

once you do the mdadm /dev/mdX -a /dev/newdrive the new drive will have 
the UUID.  no need to update the mdadm.conf for the UUID...

however if you're using DEVICE foo where foo is not partitions then 
you should make sure foo includes the new drive.  (DEVICE partitions is 
recommended.)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-09-05 Thread dean gaudet
On Mon, 4 Sep 2006, Bill Davidsen wrote:

 But I think most of the logic exists, the hardest part would be deciding what
 to do. The existing code looks as if it could be hooked to do this far more
 easily than writing new. In fact, several suggested recovery schemes involve
 stopping the RAID5, replacing the failing drive with a created RAID1, etc. So
 the method is valid, it would just be nice to have it happen without human
 intervention.

you don't actually have to stop the raid5 if you're using bitmaps... you 
can just remove the disk, create a (superblockless) raid1 and put the 
raid1 back in place.

the whole process could be handled a lot like mdadm handles spare groups 
already... there isn't a lot more kernel support required.

the largest problem is if a power failure occurs before the process 
finishes.  i'm 95% certain that even during a reconstruction, raid1 writes 
go to all copies even if the write is beyond the current sync position[1] 
-- so the raid5 superblock would definitely have been written to the 
partial disk... so that means on a reboot there'll be two disks which look 
like they're both the same (valid) component of the raid5, and one of them 
definitely isn't.

maybe there's some trick to handle this situation -- aside from ensuring 
the array won't come up automatically on reboot until after the process 
has finished.

one way to handle it would be to have an option for raid1 resync which 
suppresses writes which are beyond the resync position... then you could 
zero the new disk superblock to start with, and then start up the resync 
-- then it won't have a valid superblock until the entire disk is copied.

-dean

[1] there's normally a really good reason for raid1 to mirror all writes 
even if they're beyond the resync point... consider the case where you 
have a system crash and have 2 essentially idential mirrors which then 
need a resync... and the source disk dies during the resync.

if all writes have been mirrored then the other disk is already useable 
(in fact it's essentially arbitrary which of the mirrors was used for the 
resync source after the crash -- they're all equally (un)likely to have 
the most current data)... without bitmaps this sort of thing is a common 
scenario and certainly saved my data more than once.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-5 recovery

2006-09-03 Thread dean gaudet
On Sun, 3 Sep 2006, Clive Messer wrote:

 This leads me to a question. I understand from reading the linux-raid 
 archives 
 that the current behaviour when rebuilding with a single badblock on another 
 disk is for that disk to also be kicked from the array.

that's not quite the current behaviour.  since 2.6.14 or .15 or so md will 
reconstruct bad blocks from other disks and try writing them.  it's only 
when this fails repeatedly that it knocks the disk out of the array.

-dean

-- 
VGER BF report: H 0.347442
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-30 Thread dean gaudet
On Sun, 13 Aug 2006, dean gaudet wrote:

 On Fri, 11 Aug 2006, David Rees wrote:
 
  On 8/11/06, dean gaudet [EMAIL PROTECTED] wrote:
   On Fri, 11 Aug 2006, David Rees wrote:
   
On 8/10/06, dean gaudet [EMAIL PROTECTED] wrote:
 - set up smartd to run long self tests once a month.   (stagger it 
 every
   few days so that your disks aren't doing self-tests at the same 
 time)
   
I personally prefer to do a long self-test once a week, a month seems
like a lot of time for something to go wrong.
   
   unfortunately i found some drives (seagate 400 pata) had a rather negative
   effect on performance while doing self-test.
  
  Interesting that you noted negative performance, but I typically
  schedule the tests for off-hours anyway where performance isn't
  critical.
  
  How much of a performance hit did you notice?
 
 i never benchmarked it explicitly.  iirc the problem was generally 
 metadata performance... and became less of an issue when i moved the 
 filesystem log off the raid5 onto a raid1.  unfortunately there aren't 
 really any off hours for this system.

the problem reappeared... so i can provide some data.  one of the 400GB 
seagates has been stuck at 20% of a SMART long self test for over 2 days 
now, and the self-test itself has been going for about 4.5 days total.

a typical iostat -x /dev/sd[cdfgh] 30 sample looks like this:

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz 
  await  svctm  %util
sdc  90.94   137.52 14.70 25.76   841.32  1360.3554.43 0.94 
  23.30  10.30  41.68
sdd  93.67   140.52 14.96 22.06   863.98  1354.7559.93 0.91 
  24.50  12.17  45.05
sdf  92.84   136.85 15.36 26.39   857.85  1360.3553.13 0.88 
  21.04  10.59  44.21
sdg  87.74   137.82 14.23 24.86   807.73  1355.5555.35 0.85 
  21.86  11.25  43.99
sdh  87.20   134.56 14.96 28.29   810.13  1356.8850.10 1.90 
  43.72  20.02  86.60

those 5 are in a raid5, so their io should be relatively even... notice 
the await, svctm and %util of sdh compared to the other 4.  sdh is the one 
with the exceptionally slow going SMART long self-test.  i assume it's 
still making progress because the effect is measurable in iostat.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-08-29 Thread dean gaudet
On Wed, 30 Aug 2006, Neil Bortnak wrote:

 Hi Everybody,
 
 I had this major recovery last week after a hardware failure monkeyed
 things up pretty badly. About half way though I had a couple of ideas
 and I thought I'd suggest/ask them.
 
 1) Drive Linking: So let's say I have a 6 disk RAID5 array and I have
 reason to believe one of the drives will fail (funny noises, SMART
 warnings or it's *really* slow compared to the other drives, etc). It
 would be nice to put in a new drive, link it to the failing disk so that
 it copies all of the data to the new one and mirrors new writes as they
 happen.

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

works for any raid level actually.


 2) This sort of brings up a subject I'm getting increasingly paranoid
 about. It seems to me that if disk 1 develops a unrecoverable error at
 block 500 and disk 4 develops one at 55,000 I'm going to get a double
 disk failure as soon as one of the bad blocks is read (or some other
 system problem -makes it look like- some random block is
 unrecoverable). Such an error should not bring the whole thing to a
 crashing halt. I know I can recover from that sort of error manually,
 but yuk.

Neil made some improvements in this area as of 2.6.15... when md gets a 
read error it won't knock the entire drive out immediately -- it first 
attempts to reconstruct the sectors from the other drives and write them 
back.  this covers a lot of the failure cases because the drive will 
either successfully complete the write in-place, or use its reallocation 
pool.  the kernel logs when it makes such a correction (but the log wasn't 
very informative until 2.6.18ish i think).

if you watch SMART data (either through smartd logging changes for you, or 
if you diff the output regularly) you can see this activity happen as 
well.

you can also use the check/repair sync_actions to force this to happen 
when you know a disk has a Current_Pending_Sector (i.e. pending read 
error).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is mdadm --create safe for existing arrays ?

2006-08-16 Thread dean gaudet
On Wed, 16 Aug 2006, Peter Greis wrote:

 So, how do I change / and /boot to make the super
 blocks persistent ? Is it safe to run mdadm --create
 /dev/md0 --raid-devices=2 --level=1 /dev/sda1
 /dev/sdb1 without loosing any data ?

boot a rescue disk

shrink the filesystems by a few MB to accomodate the superblock

mdadm --create /dev/md0 --raid-devices=2 --level=1 /dev/sda1 missing
mdadm /dev/md0 -a /dev/sdb1

grow the filesystem

you could probably get away with an --assume-clean and no resync if you 
know the array is clean... just don't forget to shrink/grow the 
filesystem.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-13 Thread dean gaudet
On Fri, 11 Aug 2006, David Rees wrote:

 On 8/11/06, dean gaudet [EMAIL PROTECTED] wrote:
  On Fri, 11 Aug 2006, David Rees wrote:
  
   On 8/10/06, dean gaudet [EMAIL PROTECTED] wrote:
- set up smartd to run long self tests once a month.   (stagger it every
  few days so that your disks aren't doing self-tests at the same time)
  
   I personally prefer to do a long self-test once a week, a month seems
   like a lot of time for something to go wrong.
  
  unfortunately i found some drives (seagate 400 pata) had a rather negative
  effect on performance while doing self-test.
 
 Interesting that you noted negative performance, but I typically
 schedule the tests for off-hours anyway where performance isn't
 critical.
 
 How much of a performance hit did you notice?

i never benchmarked it explicitly.  iirc the problem was generally 
metadata performance... and became less of an issue when i moved the 
filesystem log off the raid5 onto a raid1.  unfortunately there aren't 
really any off hours for this system.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-10 Thread dean gaudet
suggestions:

- set up smartd to run long self tests once a month.   (stagger it every 
  few days so that your disks aren't doing self-tests at the same time)

- run 2.6.15 or later so md supports repairing read errors from the other 
  drives...

- run 2.6.16 or later so you get the check and repair sync_actions in
  /sys/block/mdX/md/sync_action (i think 2.6.16.x still has a bug where
  you have to echo a random word other than repair to sync_action to get
  a repair to start... wrong sense on a strcmp, fixed in 2.6.17).

- run nightly diffs of smartctl -a output on all your drives so you see 
  when one of them reports problems in the smart self test or otherwise
  has a Current_Pending_Sectors or Realloc event... then launch a
  repair sync_action.

- proactively replace your disks every couple years (i prefer to replace 
  busy disks before 3 years).

-dean

On Wed, 9 Aug 2006, James Peverill wrote:

 
 In this case the raid WAS the backup... however it seems it turned out to be
 less reliable than the single disks it was supporting.  In the future I think
 I'll make sure my disks have varying ages so they don't fail all at once.
 
 James
 
   RAID is no excuse for backups.
 PS: ctrlpgup
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still can't get md arrays that were started from an initrd to shutdown

2006-07-17 Thread dean gaudet
On Mon, 17 Jul 2006, Christian Pernegger wrote:

 The problem seems to affect only arrays that are started via an
 initrd, even if they do not have the root filesystem on them.
 That's all arrays if they're either managed by EVMS or the
 ramdisk-creator is initramfs-tools. For yaird-generated initrds only
 the array with root on it is affected.

with lvm you have to stop lvm before you can stop the arrays... i wouldn't 
be surprised if evms has the same issue... of course this *should* happen 
cleanly on shutdown assuming evms is also being shutdown... but maybe that 
gives you something to look for.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive raid5 disk replacement success (using bitmap + raid1)

2006-06-22 Thread dean gaudet
well that part is optional... i wasn't replacing the disk right away 
anyhow -- it had just exhibited its first surface error during SMART and i 
thought i'd try moving the data elsewhere just for the experience of it.

-dean

On Thu, 22 Jun 2006, Ming Zhang wrote:

 Hi Dean
 
 Thanks a lot for sharing this.
 
 I am not quite understand about these 2 commands. Why we want to add a
 pre-failing disk back to md4?
 
 mdadm --zero-superblock /dev/sde1
 mdadm /dev/md4 -a /dev/sde1
 
 Ming
 
 
 On Sun, 2006-04-23 at 18:40 -0700, dean gaudet wrote:
  i had a disk in a raid5 which i wanted to clone onto the hot spare... 
  without going offline and without long periods without redundancy.  a few 
  folks have discussed using bitmaps and temporary (superblockless) raid1 
  mappings to do this... i'm not sure anyone has tried / reported success 
  though.  this is my success report.
  
  setup info:
  
  - kernel version 2.6.16.9 (as packaged by debian)
  - mdadm version 2.4.1
  - /dev/md4 is the raid5
  - /dev/sde1 is the disk in md4 i want to clone from
  - /dev/sdh1 is the hot spare from md4, and is the clone target
  - /dev/md5 is an unused md device name
  
  here are the exact commands i issued:
  
  mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
  mdadm /dev/md4 -r /dev/sdh1
  mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
  mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
  mdadm /dev/md4 --re-add /dev/md5
  mdadm /dev/md5 -a /dev/sdh1
  
  ... wait a few hours for md5 resync...
  
  mdadm /dev/md4 -f /dev/md5 -r /dev/md5
  mdadm --stop /dev/md5
  mdadm /dev/md4 --re-add /dev/sdh1
  mdadm --zero-superblock /dev/sde1
  mdadm /dev/md4 -a /dev/sde1
  
  this sort of thing shouldn't be hard to script :)
  
  the only times i was without full redundancy was briefly between the -r 
  and --re-add commands... and with bitmap support the raid5 resync for 
  each of those --re-adds was essentially zero.
  
  thanks Neil (and others)!
  
  -dean
  
  p.s. it's absolutely necessary to use --build for the temporary raid1 
  ... if you use --create mdadm will rightfully tell you it's already a raid 
  component and if you --force it then you'll trash the raid5 superblock and 
  it won't fit into the raid5 any more...
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-30 Thread dean gaudet
On Wed, 31 May 2006, Neil Brown wrote:

 On Tuesday May 30, [EMAIL PROTECTED] wrote:
  
  actually i think the rate is higher... i'm not sure why, but klogd doesn't 
  seem to keep up with it:
  
  [EMAIL PROTECTED]:~# grep -c kblockd_schedule_work /var/log/messages
  31
  [EMAIL PROTECTED]:~# dmesg | grep -c kblockd_schedule_work
  8192
 
 # grep 'last message repeated' /var/log/messages
 ??

um hi, of course :)  the paste below is approximately correct.

-dean

[EMAIL PROTECTED]:~# egrep 'kblockd_schedule_work|last message repeated' 
/var/log/messages
May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
May 30 17:12:56 localhost last message repeated 22 times
May 30 17:14:14 localhost kernel: kblockd_schedule_work failed
May 30 17:16:57 localhost kernel: kblockd_schedule_work failed
May 30 17:17:00 localhost last message repeated 83 times
May 30 17:17:02 localhost kernel: kblockd_schedule_work failed
May 30 17:17:33 localhost last message repeated 950 times
May 30 17:18:34 localhost last message repeated 2218 times
May 30 17:19:35 localhost last message repeated 1581 times
May 30 17:20:01 localhost last message repeated 579 times
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost last message repeated 23 times
May 30 17:20:03 localhost kernel: kblockd_schedule_work failed
May 30 17:20:34 localhost last message repeated 1058 times
May 30 17:21:35 localhost last message repeated 2171 times
May 30 17:22:36 localhost last message repeated 2305 times
May 30 17:23:37 localhost last message repeated 2311 times
May 30 17:24:38 localhost last message repeated 1993 times
May 30 17:25:01 localhost last message repeated 702 times
May 30 17:25:02 localhost kernel: kblockd_schedule_work failed
May 30 17:25:02 localhost last message repeated 15 times
May 30 17:25:02 localhost kernel: kblockd_schedule_work failed
May 30 17:25:02 localhost last message repeated 12 times
May 30 17:25:03 localhost kernel: kblockd_schedule_work failed
May 30 17:25:34 localhost last message repeated 1061 times
May 30 17:26:35 localhost last message repeated 2009 times
May 30 17:27:36 localhost last message repeated 1941 times
May 30 17:28:37 localhost last message repeated 2345 times
May 30 17:29:38 localhost last message repeated 2367 times
May 30 17:30:01 localhost last message repeated 870 times
May 30 17:30:01 localhost kernel: kblockd_schedule_work failed
May 30 17:30:01 localhost last message repeated 45 times
May 30 17:30:02 localhost kernel: kblockd_schedule_work failed
May 30 17:30:33 localhost last message repeated 1180 times
May 30 17:31:34 localhost last message repeated 2062 times
May 30 17:32:34 localhost last message repeated 2277 times
May 30 17:32:36 localhost kernel: kblockd_schedule_work failed
May 30 17:33:07 localhost last message repeated 1114 times
May 30 17:34:08 localhost last message repeated 2308 times
May 30 17:35:01 localhost last message repeated 1941 times
May 30 17:35:01 localhost kernel: kblockd_schedule_work failed
May 30 17:35:02 localhost last message repeated 20 times
May 30 17:35:02 localhost kernel: kblockd_schedule_work failed
May 30 17:35:33 localhost last message repeated 1051 times
May 30 17:36:34 localhost last message repeated 2002 times
May 30 17:37:35 localhost last message repeated 1644 times
May 30 17:38:36 localhost last message repeated 1731 times
May 30 17:39:37 localhost last message repeated 1844 times
May 30 17:40:01 localhost last message repeated 817 times
May 30 17:40:02 localhost kernel: kblockd_schedule_work failed
May 30 17:40:02 localhost last message repeated 39 times
May 30 17:40:02 localhost kernel: kblockd_schedule_work failed
May 30 17:40:02 localhost last message repeated 12 times
May 30 17:40:03 localhost kernel: kblockd_schedule_work failed
May 30 17:40:34 localhost last message repeated 1051 times
May 30 17:41:35 localhost last message repeated 1576 times
May 30 17:42:36 localhost last message repeated 2000 times
May 30 17:43:37 localhost last message repeated 2058 times
May 30 17:44:15 localhost last message repeated 1337 times
May 30 17:44:15 localhost kernel: kblockd_schedule_work failed
May 30 17:44:46 localhost last message repeated 1016 times
May 30 17:45:01 localhost last message repeated 432 times
May 30 17:45:02 localhost kernel: kblockd_schedule_work failed
May 30 17:45:02 localhost kernel: kblockd_schedule_work failed
May 30 17:45:33 localhost last message repeated 1229 times
May 30 17:46:34 localhost last message repeated 2552 times
May 30 17:47:36 localhost last message repeated 

Re: raid5 hang on get_active_stripe

2006-05-29 Thread dean gaudet
On Sun, 28 May 2006, Neil Brown wrote:

 The following patch adds some more tracing to raid5, and might fix a
 subtle bug in ll_rw_blk, though it is an incredible long shot that
 this could be affecting raid5 (if it is, I'll have to assume there is
 another bug somewhere).   It certainly doesn't break ll_rw_blk.
 Whether it actually fixes something I'm not sure.
 
 If you could try with these on top of the previous patches I'd really
 appreciate it.
 
 When you read from /stripe_cache_active, it should trigger a
 (cryptic) kernel message within the next 15 seconds.  If I could get
 the contents of that file and the kernel messages, that should help.

got the hang again... attached is the dmesg with the cryptic messages.  i 
didn't think to grab the task dump this time though.

hope there's a clue in this one :)  but send me another patch if you need 
more data.

-dean

neemlark:/sys/block/md4/md# cat stripe_cache_size 
256
neemlark:/sys/block/md4/md# cat stripe_cache_active 
251
0 preread
plugged
bitlist=0 delaylist=251
neemlark:/sys/block/md4/md# cat stripe_cache_active 
251
0 preread
plugged
bitlist=0 delaylist=251
neemlark:/sys/block/md4/md# echo 512 stripe_cache_size 
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
292 preread
not plugged
bitlist=0 delaylist=32
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
292 preread
not plugged
bitlist=0 delaylist=32
neemlark:/sys/block/md4/md# cat stripe_cache_active
445
0 preread
not plugged
bitlist=0 delaylist=73
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
413
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
13
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
493
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
487
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
405
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
1 preread
not plugged
bitlist=0 delaylist=28
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
84 preread
not plugged
bitlist=0 delaylist=69
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
69 preread
not plugged
bitlist=0 delaylist=56
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
41 preread
not plugged
bitlist=0 delaylist=38
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
10 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
453
3 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
14 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
477
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
476
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
486
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
384
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
387
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
462
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
448
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
501
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
476
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
416
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
386
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
434
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
406
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
447
0 preread
not plugged
bitlist=0 

Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)

2006-05-28 Thread dean gaudet
On Sun, 28 May 2006, Luca Berra wrote:

 - mdadm-2.5-rand.patch
 Posix dictates rand() versus bsd random() function, and dietlibc
 deprecated random(), so switch to srand()/rand() and make everybody
 happy.

fwiw... lots of rand()s tend to suck... and RAND_MAX may not be large 
enough for this use.  glibc rand() is the same as random().  do you know 
if dietlibc's rand() is good enough?

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)

2006-05-28 Thread dean gaudet
On Sun, 28 May 2006, Luca Berra wrote:

 dietlibc rand() and random() are the same function.
 but random will throw a warning saying it is deprecated.

that's terribly obnoxious... it's never going to be deprecated, there are 
only approximately a bazillion programs using random().

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-26 Thread dean gaudet
On Tue, 23 May 2006, Neil Brown wrote:

 I've spent all morning looking at this and while I cannot see what is
 happening I did find a couple of small bugs, so that is good...
 
 I've attached three patches.  The first fix two small bugs (I think).
 The last adds some extra information to
   /sys/block/mdX/md/stripe_cache_active
 
 They are against 2.6.16.11.
 
 If you could apply them and if the problem recurs, report the content
 of stripe_cache_active several times before and after changing it,
 just like you did last time, that might help throw some light on the
 situation.

i applied them against 2.6.16.18 and two days later i got my first hang... 
below is the stripe_cache foo.

thanks
-dean

neemlark:~# cd /sys/block/md4/md/
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_size 
256
neemlark:/sys/block/md4/md# echo 512 stripe_cache_size
neemlark:/sys/block/md4/md# cat stripe_cache_active
474
187 preread
bitlist=0 delaylist=222
neemlark:/sys/block/md4/md# cat stripe_cache_active
438
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
438
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
469
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
72 preread
bitlist=160 delaylist=103
neemlark:/sys/block/md4/md# cat stripe_cache_active
1
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
2
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
2
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# 

md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1]
  1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [U]
  bitmap: 10/187 pages [40KB], 1024KB chunk
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-26 Thread dean gaudet
On Sat, 27 May 2006, Neil Brown wrote:

 On Friday May 26, [EMAIL PROTECTED] wrote:
  On Tue, 23 May 2006, Neil Brown wrote:
  
  i applied them against 2.6.16.18 and two days later i got my first hang... 
  below is the stripe_cache foo.
  
  thanks
  -dean
  
  neemlark:~# cd /sys/block/md4/md/
  neemlark:/sys/block/md4/md# cat stripe_cache_active 
  255
  0 preread
  bitlist=0 delaylist=255
  neemlark:/sys/block/md4/md# cat stripe_cache_active 
  255
  0 preread
  bitlist=0 delaylist=255
  neemlark:/sys/block/md4/md# cat stripe_cache_active 
  255
  0 preread
  bitlist=0 delaylist=255
 
 Thanks.  This narrows it down quite a bit... too much infact:  I can
 now say for sure that this cannot possible happen :-)

heheh.  fwiw the box has traditionally been rock solid.. it's ancient 
though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 
w/seagate 400GB disks... i really don't suspect the hardware all that much 
because the freeze seems to be rather consistent as to time of day 
(overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb 
going).  unfortunately it doesn't happen every time... but every time i've 
unstuck the box i've noticed those processes going.

other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3
and one with xfs.


 Two things that might be helpful:
   1/ Do you have any other patches on 2.6.16.18 other than the 3 I
 sent you?  If you do I'd like to see them, just in case.

it was just 2.6.16.18 plus the 3 you sent... i attached the .config
(it's rather full -- based off debian kernel .config).

maybe there's a compiler bug:

gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)


   2/ The message.gz you sent earlier with the
   echo t  /proc/sysrq-trigger
  trace in it didn't contain information about md4_raid5 - the 
  controlling thread for that array.  It must have missed out
  due to a buffer overflowing.  Next time it happens, could you
  to get this trace again and see if you can find out what
  what md4_raid5 is going.  Maybe do the 'echo t' several times.
  I think that you need a kernel recompile to make the dmesg
  buffer larger.

ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ...

note that i'm going to include two more patches in this next kernel:

http://lkml.org/lkml/2006/5/23/42
http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch

the first was the Jens Axboe patch you mentioned here recently (for
accounting with i/o barriers)... and the second gets rid of the tcp
treason uncloaked messages.


 Thanks for your patience - this must be very frustrating for you.

fortunately i'm the primary user of this box... and the bug doesn't
corrupt anything... and i can unstick it easily :)  so it's not all that
frustrating actually.

-dean

config.gz
Description: Binary data


Re: raid5 hang on get_active_stripe

2006-05-17 Thread dean gaudet
On Thu, 11 May 2006, dean gaudet wrote:

 On Tue, 14 Mar 2006, Neil Brown wrote:
 
  On Monday March 13, [EMAIL PROTECTED] wrote:
   I just experienced some kind of lockup accessing my 8-drive raid5
   (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
   now processes that try to read the md device hang. ps tells me they are
   all sleeping in get_active_stripe. There is nothing in the syslog, and I
   can read from the individual drives fine with dd. mdadm says the state
   is active.
...
 
 i seem to be running into this as well... it has happenned several times 
 in the past three weeks.  i attached the kernel log output...

it happenned again...  same system as before...


  You could try increasing the size of the stripe cache
echo 512  /sys/block/mdX/md/stripe_cache_size
  (choose and appropriate 'X').
 
 yeah that got things going again -- it took a minute or so maybe, i
 wasn't paying attention as to how fast things cleared up.

i tried 768 this time and it wasn't enough... 1024 did it again...

 
  Maybe check the content of
   /sys/block/mdX/md/stripe_cache_active
  as well.
 
 next time i'll check this before i increase stripe_cache_size... it's
 0 now, but the raid5 is working again...

here's a sequence of things i did... not sure if it helps:

# cat /sys/block/md4/md/stripe_cache_active
435
# cat /sys/block/md4/md/stripe_cache_size
512
# echo 768 /sys/block/md4/md/stripe_cache_size
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# echo 1024 /sys/block/md4/md/stripe_cache_size
# cat /sys/block/md4/md/stripe_cache_active
927
# cat /sys/block/md4/md/stripe_cache_active
151
# cat /sys/block/md4/md/stripe_cache_active
66
# cat /sys/block/md4/md/stripe_cache_active
2
# cat /sys/block/md4/md/stripe_cache_active
1
# cat /sys/block/md4/md/stripe_cache_active
0
# cat /sys/block/md4/md/stripe_cache_active
3

and it's OK again... except i'm going to lower the stripe_cache_size to
256 again because i'm not sure i want to keep having to double it each
freeze :)

let me know if you want the task dump output from this one too.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


proactive raid5 disk replacement success (using bitmap + raid1)

2006-04-23 Thread dean gaudet
i had a disk in a raid5 which i wanted to clone onto the hot spare... 
without going offline and without long periods without redundancy.  a few 
folks have discussed using bitmaps and temporary (superblockless) raid1 
mappings to do this... i'm not sure anyone has tried / reported success 
though.  this is my success report.

setup info:

- kernel version 2.6.16.9 (as packaged by debian)
- mdadm version 2.4.1
- /dev/md4 is the raid5
- /dev/sde1 is the disk in md4 i want to clone from
- /dev/sdh1 is the hot spare from md4, and is the clone target
- /dev/md5 is an unused md device name

here are the exact commands i issued:

mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
mdadm /dev/md4 -r /dev/sdh1
mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
mdadm /dev/md4 --re-add /dev/md5
mdadm /dev/md5 -a /dev/sdh1

... wait a few hours for md5 resync...

mdadm /dev/md4 -f /dev/md5 -r /dev/md5
mdadm --stop /dev/md5
mdadm /dev/md4 --re-add /dev/sdh1
mdadm --zero-superblock /dev/sde1
mdadm /dev/md4 -a /dev/sde1

this sort of thing shouldn't be hard to script :)

the only times i was without full redundancy was briefly between the -r 
and --re-add commands... and with bitmap support the raid5 resync for 
each of those --re-adds was essentially zero.

thanks Neil (and others)!

-dean

p.s. it's absolutely necessary to use --build for the temporary raid1 
... if you use --create mdadm will rightfully tell you it's already a raid 
component and if you --force it then you'll trash the raid5 superblock and 
it won't fit into the raid5 any more...
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-11 Thread dean gaudet


On Mon, 10 Apr 2006, Marc L. de Bruin wrote:

 dean gaudet wrote:
  On Mon, 10 Apr 2006, Marc L. de Bruin wrote:
  
   However, all preferred minors are correct, meaning that the output is in
   sync with what I expected it to be from /etc/mdadm/mdadm.conf.
   
   Any other ideas? Just adding /etc/mdadm/mdadm.conf to the initrd does not
   seem
   to work, since mdrun seems to ignore it?!
 
  it seems to me mdrun /dev is about the worst thing possible to use in an
  initrd.
 
 :-)
 
 I guess I'll have to change to yaird asap then. I can't think of any other
 solid solution...

yeah i've been yaird... it's not perfect -- take a look at 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=351183 for a patch i 
use to improve the ability of a yaird initrd booting when you've moved 
devices or a device has failed.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


forcing a read on a known bad block

2006-04-11 Thread dean gaudet
hey Neil...

i've been wanting to test out the reconstruct-on-read-error code... and 
i've had two chances to do so, but haven't be able to force md to read the 
appropriate block to trigger the code.

i had two disks with SMART Current_Pending_Sector  0 (which indicates 
pending read error) and i did SMART long self-tests to find out where the 
bad block was (it should show the LBA in the SMART error log)...

one disk was in a raid1 -- and so it was kind of random which of the two 
disks would be read from if i tried to seek to that LBA and read... in 
theory with O_DIRECT i should have been able to randomly get the right 
disk, but that seems a bit clunky.  unfortunately i didn't think of the 
O_DIRECT trick until after i'd given up and decided to just resync the 
whole disk proactively.

the other disk was in a raid5 ... 5 disk raid5, so 20% chance of the bad 
block being in parity.  i copied the kernel code to be sure, and sure 
enough the bad block was in parity... just bad luck :)  so i can't force a 
read there any way that i know of...

anyhow this made me wonder if there's some other existing trick to force 
such reads/reconstructions to occur... or perhaps this might be a useful 
future feature.

on the raid5 disk i actually tried reading the LBA directly from the 
component device and it didn't trigger the read error, so now i'm a bit 
skeptical of the SMART log and/or my computation of the seek offset in the 
partition... but the above question is still interesting.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-10 Thread dean gaudet
On Mon, 10 Apr 2006, Marc L. de Bruin wrote:

 dean gaudet wrote:
 
  initramfs-tools generates an mdrun /dev which starts all the raids it can
  find... but does not include the mdadm.conf in the initrd so i'm not sure it
  will necessarily start them in the right minor devices.  try doing an mdadm
  --examine /dev/xxx on some of your partitions to see if the preferred
  minor is what you expect it to be...
   
 [EMAIL PROTECTED]:~# sudo mdadm --examine /dev/md[01234]

try running it on /dev/sda1 or whatever the component devices are for your 
array... not on the array devices.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-10 Thread dean gaudet
On Mon, 10 Apr 2006, Marc L. de Bruin wrote:

 However, all preferred minors are correct, meaning that the output is in
 sync with what I expected it to be from /etc/mdadm/mdadm.conf.
 
 Any other ideas? Just adding /etc/mdadm/mdadm.conf to the initrd does not seem
 to work, since mdrun seems to ignore it?!

yeah it looks like mdrun /dev just seems to assign things in the order 
they're discovered without consulting the preferred minor.

it seems to me mdrun /dev is about the worst thing possible to use in an 
initrd.

i opened a bug yesterday 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=361674 ... it seems 
really they should stop using mdrun entirely... when i get a chance i'll 
try updating the bug (or go ahead and add your own experiences to it).

oh hey take a look at this bug for debian mdadm package 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=354705 ... he intends 
to deprecate mdrun.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-09 Thread dean gaudet
On Sun, 9 Apr 2006, Marc L. de Bruin wrote:

...
 Okay, just pressing Control-D continues the boot process and AFAIK the root
 filesystemen actually isn't corrupt. Running e2fsck returns no errors and
 booting 2.6.11 works just fine, but I have no clue why it picked the wrong
 partitions to build md[01234].
 
 What could have happened here?

i didn't know sarge had 2.6.11 or 2.6.15 packages... but i'm going to 
assume you've installed one of initramfs-tools or yaird in order to use 
the unstable 2.6.11 or 2.6.15 packages... so my comments might not apply.

initramfs-tools generates an mdrun /dev which starts all the raids it 
can find... but does not include the mdadm.conf in the initrd so i'm not 
sure it will necessarily start them in the right minor devices.  try doing 
an mdadm --examine /dev/xxx on some of your partitions to see if the 
preferred minor is what you expect it to be...

if the preferred minors are wrong there's some mdadm incantation to update 
them... see the man page.

or switch to yaird (you'll have to install yaird and purge 
initramfs-tools) and dpkg-reconfigure your kernel packages to cause the 
initrds to be rebuilt.  yaird starts only the raid required for the root 
filesystem, and specifies the correct minor for it.  then later after the 
initrd /etc/init.d/mdadm-raid will start the rest of your raids using your 
mdadm.conf.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 high cpu usage during reads - oprofile results

2006-04-01 Thread dean gaudet
On Sat, 1 Apr 2006, Alex Izvorski wrote:

 Dean - I think I see what you mean, you're looking at this line in the
 assembly?
 
 65830 16.8830 : c1f:   cmp%rcx,0x28(%rax)

yup that's the one... that's probably a fair number of cache (or tlb) 
misses going on right there.


 I looked at the hash stuff, I think the problem is not that the hash
 function is poor, but rather that the number of entries in all buckets
 gets to be pretty high.

yeah... your analysis seems more likely.

i suppose increasing the number of buckets is the only option.  it looks 
to me like you'd just need to change NR_HASH and the kzalloc in run() in 
order to increase the number of buckets.

i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause 
otherwise it'd be cool to run with STRIPE_SIZE the same as your raid 
chunksize... which would decrease the number of entries -- much more 
desirable than increasing the number of buckets.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 that used parity for reads only when degraded

2006-03-24 Thread dean gaudet
On Thu, 23 Mar 2006, Alex Izvorski wrote:

 Also the cpu load is measured with Andrew Morton's cyclesoak
 tool which I believe to be quite accurate.

there's something cyclesoak does which i'm not sure i agree with: 
cyclesoak process dirties an array of 100 bytes... so what you're 
really getting is some sort of composite measurement of memory system 
utilisation and cpu cycle availability.

i think that 1MB number was chosen before 1MiB caches were common... and 
what you get during calibration is a L2 cache-hot loop, but i'm not sure 
that's an important number.

i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB 
... and decrease it to 128.  the two extremes are going to weight the cpu 
load towards measuring available memory system bandwidth and available 
cpu cycles.

also for calibration consider using a larger -p n ... especially if 
you've got any cpufreq/powernowd setup which is varying your clock 
rates... you want to be sure that it's calibrated (and measured) at a 
fixed clock rate.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: naming of md devices

2006-03-22 Thread dean gaudet
On Thu, 23 Mar 2006, Nix wrote:

 Last I heard the Debian initramfs constructs RAID arrays by explicitly
 specifying the devices that make them up. This is, um, a bad idea:
 the first time a disk fails or your kernel renumbers them you're
 in *trouble*.

yaird seems to dtrt ... at least in unstable.  if you install yaird 
instead of initramfs-tools you get stuff like this in the initrd /init:

mknod /dev/md3 b 9 3
mdadm -Ac partitions /dev/md3 --uuid 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88

initramfs-tools also appears to do something which should work... but i 
haven't tested it... it basically runs mdrun /dev without specifying a 
minor/uuid for the root, so it'll start all arrays... i'm afraid that 
might mess up for one of my arrays which is auto=mdp... and has the 
annoying property of starting arrays on disks you've moved from other 
systems.

so anyhow i lean towards yaird at the moment... (and i should submit some 
bug reports i guess).

the above is on unstable... i don't use stable (and stable definitely does 
the wrong thing -- 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338200).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a disk

2006-03-11 Thread dean gaudet
On Sat, 11 Mar 2006, Ming Zhang wrote:

 On Sat, 2006-03-11 at 06:53 -0500, Paul M. wrote:
  Since its raid5 you would be fine just pulling the disk out and
  letting the raid driver rebuild the array. If you have a hot spare
 
 yes, rebuilding is the simplest way. but rebuild will need to read all
 other disks and write to the new disk. when serving some io at same
 time, the rebuilding speed is not much,
 
 but if i do a dd clone and plug it back. the total traffic is copy one
 disk which can be done very fast as a fully sequential workload. with
 that bitmap feature, the rsync work after plugging back is minor.
 
 so the one disk fail window is pretty small here. right?

you're planning to do this while the array is online?  that's not safe... 
unless it's a read-only array...

if you've got a bitmap then one thing you *could* do is stop the array 
temporarily, and copy the bitmap first, then restart the array... then 
copy the rest of the disk minus the bitmap.

you basically need an atomic copy of the bitmap from before you start the 
ddrescue... and you need to use that copy of the bitmap when you 
reassemble the array with the new disk.

or you could stop the raid5, and make a raid1 (legacy style, without raid 
superblock) of the dying disk and the new disk... then reassemble the 
raid5 using the raid1 for the one component... then restart the raid5.

regardless of which method you use you're going to need to take the array 
offline at least once to reassemble it with the duplicated disk in place 
of the dying disk...

i think i'd be tempted to do the raid1 method ... because that one 
requires you go offline at most once -- after the raid1 syncs you can fail 
out the dying drive and leave the raid1 around degraded until some 
future system maintenance event where you can reassemble without it.  (a 
reboot would automagically make it disappear too -- because it wouldn't 
have a raid1 superblock anyhow).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a disk

2006-03-11 Thread dean gaudet
On Sat, 11 Mar 2006, Ming Zhang wrote:

 On Sat, 2006-03-11 at 16:15 -0800, dean gaudet wrote:
 
  you're planning to do this while the array is online?  that's not safe... 
  unless it's a read-only array...
 
 what i plan to do is to pull out the disk (which is ok now but going to
 die), so raid5 will degrade with 1 disk fail and no spare disk here,
 then do ddresue to a new disk which will have same uuid and everything,
 then put it back, then bitmap will shine here right?
 
 so raid5 is still online while that disk is not part of raid5 now. and
 no diskio on it at all. so do not think i need an atomic operation here.

if you fail the disk from the array, or boot without the failing disk, 
then the event counter in the other superblocks will be updated... and the 
removed/failed disk will no longer be considered an up to date 
component... so after doing the ddrescue you'd need to reassemble the 
raid5.  i'm not sure you can convince md to use the bitmap in this case -- 
i'm just not familiar enough with it.

 this raid5 over raid1 way sounds interesting. worthy trying.

let us know how it goes :)  i've considered doing this a few times 
myself... but i've been too conservative and just taken the system down to 
single user to do the ddrescue with the raid offline entirely.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a disk

2006-03-11 Thread dean gaudet
On Sat, 11 Mar 2006, Ming Zhang wrote:

 On Sat, 2006-03-11 at 16:31 -0800, dean gaudet wrote:
  if you fail the disk from the array, or boot without the failing disk, 
  then the event counter in the other superblocks will be updated... and the 
  removed/failed disk will no longer be considered an up to date 
  component... so after doing the ddrescue you'd need to reassemble the 
  raid5.  i'm not sure you can convince md to use the bitmap in this case -- 
  i'm just not familiar enough with it.
 
 i am little confused here. then what the purpose of that bitmap for? is
 not that bitmap is for a component temporarily out of place and thus out
 of sync a bit?

hmm... yeah i suppose that is the purpose of the bitmap... i haven't used 
bitmaps yet though... so i don't know which types of events they protect 
against.  in theory what you want to do sounds like it should work though, 
but i'd experiment somewhere safe first.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Auto-assembling arrays using mdadm

2006-03-09 Thread dean gaudet
On Thu, 9 Mar 2006, Sean Puttergill wrote:

 This is the kind of functionality provided by kernel
 RAID autodetect.  You don't have to have any config
 information provided in advance.  The kernel finds and
 assembles all arrays on disks with RAID autodetect
 partition type.  I want to do the same thing, but with
 mdadm.

you know i've been bitten by that several times... the kernel sees the 
autodetect partition type of a disk which used to belong in another box 
(which for various reasons i've not yet been able to zero the old raid 
superblock)... and brings up a raid minor which conflicts with another 
array already present in the system... which causes device renaming to 
occur and can mess up the boot.  (although if i'm using UUIDs or labels 
for mounting the filesystems it can almost work -- there are still a few 
cases where those don't help... such as xfs external log partition.)

i suppose what i suggested doesn't do what you want... but i prefer not 
using kernel autoassembly these days because of the above problem.

the 1.12 man page has more examples which can help you...

echo DEVICE partitions tmp.mdadm.conf
mdadm --detail --scan --config=tmp.mdadm.conf tmp.mdadm.conf
mdadm --assemble --scan --config=tmp.mdadm.conf

i think that'll do what you want...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NVRAM support

2006-02-10 Thread dean gaudet
On Fri, 10 Feb 2006, Bill Davidsen wrote:

 Erik Mouw wrote:
 
  You could use it for an external journal, or you could use it as a swap
  device.
   
 
 Let me concur, I used external journal on SSD a decade ago with jfs (AIX). If
 you do a lot of operations which generate journal entries, file create,
 delete, etc, then it will double your performance in some cases. Otherwise it
 really doesn't help much, use as a swap device might be more helpful depending
 on your config.

it doesn't seem to make any sense at all to use a non-volatile external 
memory for swap... swap has no purpose past a power outage.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 Debian Yaird Woes

2006-02-06 Thread dean gaudet
On Sun, 5 Feb 2006, Lewis Shobbrook wrote:

 On Saturday 04 February 2006 11:22 am, you wrote:
  On Sat, 4 Feb 2006, Lewis Shobbrook wrote:
   Is there any way to avoid this requirement for input, so that the system
   skips the missing drive as the raid/initrd system did previously?
 
  what boot errors are you getting before it drops you to the root password
  prompt?
 
 Basically it just states waiting X seconds for /dev/sdx3 (corresponding to 
 the 
 missing raid5 member). Where X cycles from 2,4,8,16 and then drops you into a 
 recovery console, no root pwd prompt.
 It will only occur if the partition is completely missing, such as a 
 replacement disk with a blank partition table, or a completely missing/failed 
 drive.
  is it trying to fsck some filesystem it doesn't have access to?
 
 No fsck seen for bad extX partitions etc.

try something like this...

cd /tmp
mkdir t
cd t
zcat /boot/initrd.img-`uname -r` | cpio -i
grep -r sd.3 .

that should show us what script is directly accessing /dev/sdx3 ... maybe 
there's something more we can do about it.

i did find a possible deficiency with the patch i posted... looking more 
closely at my yaird /init i see this:

mkbdev '/dev/sdb' 'sdb'
mkbdev '/dev/sdb4' 'sdb/sdb4'
mkbdev '/dev/sda' 'sda'
mkbdev '/dev/sda4' 'sda/sda4'

and i think that means that mdadm -Ac partitions will fail if one of my 
root disks ends up somewhere other than sda or sdb... because the device 
nodes won't exist.

i suspect i should update the patch to use mdrun instead of mdadm -Ac 
partitions... because mdrun will create temporary device nodes for 
everything in /proc/partitions in order to find all the possible raid 
pieces.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 Debian Yaird Woes

2006-02-03 Thread dean gaudet
On Sat, 4 Feb 2006, Lewis Shobbrook wrote:

 Is there any way to avoid this requirement for input, so that the system 
 skips 
 the missing drive as the raid/initrd system did previously?  

what boot errors are you getting before it drops you to the root password 
prompt?

is it trying to fsck some filesystem it doesn't have access to?

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 Debian Yaird Woes

2006-02-02 Thread dean gaudet
i've never looked at yaird in detail -- but you can probably use 
initramfs-tools instead of yaird... the deb 2.6.14 and later kernels will 
use whichever one of those is installed.  i know that initramfs-tools uses 
mdrun to start the root partition based on its UUID -- and so it should 
work fine (to get root mounted) even without dorking around with 
mdadm.conf.

but if you want to stick with yaird:

On Fri, 3 Feb 2006, Lewis Shobbrook wrote:

 My mdadm.conf (I never needed to use at all previous to the yaird system) is
 as follows...
 ARRAY /dev/md0 level=raid1 num-devices=3 devices=/dev/sda2,/dev/sdb2,/dev/sdc2
 auto=yes
 ARRAY /dev/md1 level=raid5 num-devices=3 auto=yes
 UUID=a3452240:a1578a31:737679af:58f53690
 DEVICE partitions

some wrapping occured there i'm guessing...

you might be a lot happier if your /dev/md0 also specified the UUID rather 
than the individual devices.  this is probably the source of your 
troubles.

you can get the UUID by doing mdadm --examine /dev/sda2.

or you can try:  mdadm --examine --scan --brief ... just prepend DEVICE 
partitions in front of that and you should be happy.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 Debian Yaird Woes

2006-02-02 Thread dean gaudet
On Thu, 2 Feb 2006, dean gaudet wrote:

 i've never looked at yaird in detail -- but you can probably use 
 initramfs-tools instead of yaird... 

i take it all back... i just tried initramfs-tools and it failed to boot 
my system properly... whereas yaird almost got everything right.

the main thing i'd say yaird is doing wrong is that it is specifying the 
root raid devices explicitly rather than allowing mdadm to scan the 
partitions list and assemble by UUID...

maybe try the patch below on your yaird configuration and then run:

dpkg-reconfigure linux-image-`uname -r`

which will rebuild your initrd with this change... then see if it survives 
your boot testing.

-dean

p.s. this patch has been submitted to debian bugdb...

--- /etc/yaird/Templates.cfg2006/02/03 02:44:49 1.1
+++ /etc/yaird/Templates.cfg2006/02/03 02:46:15
@@ -299,8 +299,7 @@
SCRIPT /init
BEGIN
!mknod TMPL_VAR NAME=target b TMPL_VAR NAME=major 
TMPL_VAR NAME=minor
-   !mdadm --assemble TMPL_VAR NAME=target --uuid 
TMPL_VAR NAME=uuid \
-   !   TMPL_LOOP NAME=components TMPL_VAR 
NAME=dev/TMPL_LOOP
+   !mdadm -Ac partitions TMPL_VAR NAME=target --uuid 
TMPL_VAR NAME=uuid
END SCRIPT
END TEMPLATE
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Configuring combination of RAID-1 RAID-5

2006-02-01 Thread dean gaudet
On Tue, 31 Jan 2006, Enrique Garcia Briones wrote:

 I have read the setting-up for the raid-5 and 1, but I would like to ask you 
 if I can set-up a combined RAID configuration as mentioned above, since all 
 the examples I found upto now just talk of one RAD configuration

you can have more than one /dev/mdN device no problem...

if you've got a raid1 root disk setup on a box with raid5/6, and you want 
to use a journalling filesystem on the raid5/6 you should seriously 
consider saving space on the root disk for an external journal for the 
raid5/6 filesystem.  it really helps metadata-heavy operations to offload 
the journal writes to the raid1 spindles.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Updating superblock to reflect new disc locations

2006-01-11 Thread dean gaudet
On Thu, 12 Jan 2006, Neil Brown wrote:

 On Wednesday January 11, [EMAIL PROTECTED] wrote:
  
  
  Any suggestions would be greatly appreciated. The system's new and not 
  yet in production, so I can reinstall it if I have to, but I'd prefer to 
  be able to fix something as simple as this.
 
 Debian's installer - the mkinitrd part in particular - is broken.
 If you look in the initrd (it is just a compressed CPIO file) you
 will find the mdadm command used to start the root array explicitly
 lists the devices to use.  As soon as you change the devices, it stops
 working :-(  Someone should tell them about uuids.

actually i did tell them 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338200 ...

however (a) i sent the bug report well after the freeze point for the last 
stable release and (b) i think initrd-tools is on its way out... folks in 
the unstable branch are forced to use yaird or initramfs-tools for kernels 
2.6.14 and beyond.  so i'm guessing the bug report will go nowhere -- but 
i haven't looked at either of those tools yet to see if they handle md 
root well.


 I think you can probably fix you situation by:
   Booting up and having a degraded array
   hot-add the missing device and wait for it to rebuilt
   rerun mkinitrd
 
 The last bit might require some learning on your part.  I don't know
 in Debian's mkinitrd requires and command line args, or where it puts
 the result, or whether you have to tell lilo/grub about the new files.

basically run something like this:

dpkg-reconfigure linux-image-`uname -r`

(or possibly kernel-image-`uname -r` if you're on unstable branch prior to 
the renaming of the kernel packages.)

you really want to do that for all of your linux-image-* or kernel-image-* 
packages so they all get the new root locations.

another option without the rebuild is to boot a live-cd like knoppix and 
then hand-edit and repack the cramfs image from /boot... there's a script 
in the root or otherwise easy to find which has the mdadm command for 
assembling your root partition.  i'll skip the details though... i'd 
rather not try to get them all correct in a quick email.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal-guided Resynchronization for Software RAID

2005-12-08 Thread dean gaudet
On Mon, 5 Dec 2005, Neil Brown wrote:

 One of these with built in xor and raid6 would be nice, but I'm not
 sure I could guarantee a big enough market for them to try convincing
 them to make one...

i wonder if the areca cards http://www.areca.com.tw/ are 
re-programmable... they seem to have all the hardware you're looking for.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journal-guided Resynchronization for Software RAID

2005-12-02 Thread dean gaudet
On Thu, 1 Dec 2005, Neil Brown wrote:

 What I would really like is a cheap (Well, not too expensive) board
 that had at least 100Meg of NVRAM which was addressable on the PCI
 buss, and an XOR and RAID-6 engine connected to the DMA engine.

there's the mythical giga-byte i-ram ... i say mythical because i've seen
lots of reviews but haven't been able to find it for sale:

http://www.giga-byte.com/Peripherals/Products/Products_GC-RAMDISK%20(Rev%201.1).htm

the only problem with the i-ram is the lack of ecc (it could be 
implemented in a software layer though).

umem.com have what look like excellent boards but they seem unwilling to 
sell in small quantities...

http://umem.com/Umem_NVRAM_Cards.html

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still Need Help on mdadm and udev

2005-11-10 Thread dean gaudet
On Thu, 10 Nov 2005, Bill Davidsen wrote:

 I haven't had a good use for a partitionable device

i've used it to have root, swap, and some external xfs/ext3 logs on a 
single raid1... (the xfs/ext3 logs are for filesystems on another raid5) 
rather than managing 4 or 5 separate raid1s on the same 2 disks.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: s/w raid and bios renumbering HDs

2005-10-31 Thread dean gaudet
On Mon, 31 Oct 2005, Hari Bhaskaran wrote:

 Hi,
 
 I am trying to setup a RAID-1 setup for the boot/root partition. I got
 the setup working, except what I see with some of my tests leave me
 less convinced that it is actually working. My system is debian 3.1
 and I am not using the raid-setup options in the debian-installer,
 I am trying to add raid-1 to an existing system (followed
 http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html  -- 7.4 method 2)

fyi there's a debian-specific doc at 
/usr/share/doc/mdadm/rootraiddoc.97.html which i've always found useful.


 I have /dev/hda (master on primary) and /dev/hdc (master on secondary)
 setup as mirrors. I also have a cdrom on /dev/hdd. Now if I disconnect
 hda and reboot, everything seems work - except what used to be
 /dev/hdc comes up as /dev/hda. I know this since I the bios does
 complain that primary disk 0 is missing and I would have expected a
 missing hda, not a missing hdc.

huh i wonder if the bios has tweaked the ide controller to swap the 
primary/secondary somehow -- probably cuts down on support calls for 
people who plug things in wrong.  there could be a bios option to stop 
this swapping.


 Anyways, the software seems to
 recognize the failed-disk fine if I connect the real hda back. Is
 this the way it is supposed to work? Can I rely on this? Also what
 happens when I move on to fancier setups like raid5?.

the md superblock (at the end of the partition) contains reconstruction
information and UUIDs... the device names they end up on are mostly
irrelevant if you've got things configured properly.  i've moved disks
between /dev/hd* and /dev/sd* going from pata controllers to 3ware
controllers with no problem.

for raids other than the root raid you pretty much want to edit
/etc/mdadm/mdadm.conf and make sure it has DEVICE partitions and has
ARRAY entries for each of your arrays listing the UUID.  you can generate
these entries with mdadm --detail --scan (see examples on man page).
you can plug the non-root disks in any way you want and things will still
work if you've configured this.

the root is the only one which you need to be careful with -- when debian 
installs your kernel it constructs an initrd which lists the minimum 
places it will search for the root raid components... for example on one 
of my boxes:

# mkdir /mnt/cramfs
# mount -o ro,loop /boot/initrd.img-2.6.13-1-686-smp /mnt/cramfs
# cat /mnt/cramfs/script
ROOT=/dev/md3
mdadm -A /dev/md3 -R -u 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88 /dev/sdb4 /dev/sda4
# umount /mnt/cramfs

it's only expecting to look for the root raid components in those two
partitions... seems kind of unfortunate really 'cause the script could
be configured to look in any partition.

in theory you can hand-edit the initrd if you plan to move root disks to
another position... you can't mount a cramfs rw, so you need to mount,
copy, edit, and run mkcramfs ... and i suggest not deleting your original
initrd, and i suggest copypasting the /boot/grub/menu.lst entries to give
you the option of booting the old initrd or your new made-by-hand one.


 title   Debian GNU/Linux, kernel 2.6.13.3-vs2.1.0-rc4-RAID-hda
 root(hd0,0)
 kernel  /boot/vmlinuz-2.6.13.3-vs2.1.0-rc4 root=/dev/md0 ro
 initrd  /boot/initrd.img-2.6.13.3-vs2.1.0-rc4.md0
 savedefault
 boot
 
 title   Debian GNU/Linux, kernel 2.6.13.3-vs2.1.0-rc4-RAID-hdc
 root(hd1,0)
 kernel  /boot/vmlinuz-2.6.13.3-vs2.1.0-rc4 root=/dev/md0 ro
 initrd  /boot/initrd.img-2.6.13.3-vs2.1.0-rc4.md0
 savedefault
 boot

i don't think you need both.  when your first disk is dead the bios
shifts the second disk forward... and hd0 / hd1 refer to bios ordering.
i don't have both in my configs, but then i haven't bothered testing
booting off the second disk in a long time.  (i always have live-cds
such as knoppix handy for fixing boot problems.)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: s/w raid and bios renumbering HDs

2005-10-31 Thread dean gaudet


On Mon, 31 Oct 2005, Hari Bhaskaran wrote:

 So that DEVICE paritions line was really supposed to be there? Hehe... I
 thought it was just a
 help message and replaced it with DEVICE /dev/hda1 /dev/hdc1 :)

you can use DEVICE /dev/hda1 /dev/hdc1 ... but then mdadm scans will 
only consider those two partitions... if you use DEVICE partitions it'll 
look at all detected partitions for the components.  it makes it easy when 
you move disks around to new controllers and their location changes, 
things will continue to jfw.


 If I ever end up in a situation with a non-root raid down (say I did --stop),
 how do I start it back up? (--run seems
 to give me some errors). Anyways, more rtfm to do.

you want --assemble ...


  the root is the only one which you need to be careful with -- when debian
  installs your kernel it constructs an initrd which lists the minimum places
  it will search for the root raid components... for example on one of my
  boxes:
  
  # mkdir /mnt/cramfs
  # mount -o ro,loop /boot/initrd.img-2.6.13-1-686-smp /mnt/cramfs
  # cat /mnt/cramfs/script
  ROOT=/dev/md3
  mdadm -A /dev/md3 -R -u 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88 /dev/sdb4
  /dev/sda4
  # umount /mnt/cramfs

 Did u install yours with raid options in the debian installer? I dont think my
 initrd image would have all these ( I dont have
 access to the machine now to check) - but I wouldn't think the mkinitrd that I
 used to created the initrd image would
 know that I am using raid or not ( I am talking about the mdadm references in
 your script). Or are you saying you
 added these yourself?

there's really no reason to avoid the debian installer's raid support if 
you know you want raid, but i haven't used it much.

you only need to do initrd edits by hand once if you're converting root to 
a raid. there's a few steps in the debian doc about this (see Part II. 
RAID using initrd and grub) /usr/share/doc/mdadm/rootraiddoc.97.html.

after that initial change, and you've managed to boot with root on raid 
then subsequent mkinitrd *should* fill in the details automatically... 
i.e. every time you upgrade your kernel you get a new initrd, and it 
should automatically include the root raid setup.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: split RAID1 during backups?

2005-10-24 Thread dean gaudet
On Mon, 24 Oct 2005, Jeff Breidenbach wrote:

 First of all, if the data is mostly static, rsync might work faster.
 
 Any operation that stats the individual files - even to just look at
 timestamps - takes about two weeks. Therefore it is hard for me to see
 rsync as a viable solution, even though the data is mostly
 static. About 400,000 files change between weekly backups.

taking a long time to stat individual files makes me wonder if you're
suffering from atime updates and O(n) directory lookups... have you tried
this:

- mount -o noatime,nodiratime
- tune2fs -O dir_index  (and e2fsck -D)
  (you need recentish e2fsprogs for this, and i'm pretty sure you want
  2.6.x kernel)

a big hint you're suffering from atime updates is write traffic when your
fs is mounted rw, and your static webserver is the only thing running (and
your logs go elsewhere)... atime updates are probably the only writes
then.  try iostat -x 5.

a big hint you're suffering from O(n) directory lookups is heaps of system
time... (vmstat or top).


On Mon, 24 Oct 2005, Brad Campbell wrote:

 mount -o remount,ro /dev/md0 /web
 mdadm --fail /dev/md0 /dev/sdd1
 mdadm --remove /dev/md0 /dev/sdd1
 mount -o ro /dev/sdd1 /target
 
 do backup here
 
 umount /target
 mdadm -add /dev/md0 /dev/sdd1
 mount -o remount,rw /dev/md0 /web

the md event counts would be out of sync and unless you're using bitmapped 
intent logging this would cause a full resync.  if the raid wasn't online 
you could probably use one of the mdadm options to force the two devices 
to be a sync'd raid1 ... but i'm guessing you wouldn't be able to do it 
online.

other 2.6.x bleeding edge options are to mark one drive as write-mostly
so that you have no read traffic competition while doing a backup... or
just use the bitmap intent logging and a nbd to add a third, networked,
copy of the drive on another machine.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: split RAID1 during backups?

2005-10-24 Thread dean gaudet
On Mon, 24 Oct 2005, Jeff Breidenbach wrote:

 Dean, the comment about write-mostly is confusing to me.  Let's say
 I somehow marked one of the component drives write-mostly to quiet it
 down. How do I get at it? Linux will not let me mount the component
 partition if md0 is also mounted. Do you think write-mostly or
 write-behind are likely enough to be magic bullets that I should
 learn all about them?

if one drive is write-mostly, and you remount the filesystem read-only... 
then no writes should be occuring... and you could dd from the component 
drive directly and get a consistent fs image.  (i'm assuming you can 
remount the filesystem read-only for the duration of the backup because it 
sounds like that's how you do it now; and i'm assuming you're happy enough 
with your dd_rescue image...)

myself i've been considering a related problem... i don't trust LVM/DM 
snapshots in 2.6.x yet, and i've been holding back a 2.4.x box waiting for 
them to stabilise... but that seems to be taking a long time.  the box 
happens to have a 3-way raid1 anyhow, and 2.6.x bitmapped intent logging 
would give me a great snapshot backup option:  just break off one disk 
during the backup and put it back in the mirror when done.

there's probably one problem with this 3-way approach... i'll need some 
way to get the fs (ext3) to reach a safe point where no log recovery 
would be required on the disk i break out of the mirror... because under 
no circumstances do you want to write on the disk while it's outside the 
mirror.  (LVM snapshotting in 2.4.x requires a VFS lock patch which does 
exactly this when you create a snapshot.)


 John, I'm using 4KB blocks in reiserfs with tail packing.

i didn't realise you were using reiserfs... i'd suggest disabling tail 
packing... but then i've never used reiser, and i've only ever seen 
reports of tail packing having serious performance impact.  you're really 
only saving yourself an average of half a block per inode... maybe try a 
smaller block size if the disk space is an issue due to lots of inodes.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6 Query

2005-08-16 Thread dean gaudet
On Tue, 16 Aug 2005, Colonel Hell wrote:

 I just went thru a couple of papers describing RAID6. 
 I dunno how relevant this discussion grp is for the qry ...but here I go :) 
 ...
 I couldnt figure out why is P+Q configuration better over P+q' where
 q' == P. What I mean is instead of calculating a new checksum (thru a
 lot of GF theory etc) just store the parity block (P)again. In this
 case as well we have the same amount of fault tolerance or not
 :-s  ...   

this is no better than raid5 at surviving a two disk failure.  i.e. 
consider the case of two data blocks missing -- you can't reconstruct if 
all you have is parity.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SOLVED: forcing boot ordering of multilevel RAID arrays

2005-08-07 Thread dean gaudet
On Sun, 7 Aug 2005, Trevor Cordes wrote:

 Any array that is a superset of other arrays (a multilevel array) must
 set to non-autodetect.  Use fdisk to change the parition type to 83
 (standard linux), NOT fd (linux raid autodetect).

you know i'd be worried setting it to 0x83 will cause troubles now and 
then with tools assuming it's really a filesystem... personally i've used 
0xDA Non-FS data for such things in the past.  i didn't really see any 
type more appropriate... i'm hoping no tool assumes anything about 0xDA.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.4.30: bug in file md.c, line 2473

2005-04-20 Thread dean gaudet
i got the following bug from 2.4.30 while trying to hot add a device 
tonight...

i was trying to replace a disk in a 3-way raid1 -- the existing disks are 
sda, sdb, and i was replacing sdc.  each of these disks has 3 partitions, 
each with a raid1.

due to an improper shutdown the raids were being sync'd... specifically:

Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 sdb1[2] sda1[0]
  7823552 blocks [3/2] [U_U]
resync=DELAYED
md1 : active raid1 sdb2[2] sda2[0]
  3911744 blocks [3/2] [U_U]
resync=DELAYED
md2 : active raid1 sdb3[1] sda3[0]
  108318144 blocks [3/2] [UU_]
  [==..]  resync = 13.3% (14415248/108318144) 
finish=59.3min speed=26366K/sec
md3 : active raid1 sde1[1] sdd1[0]
  199141632 blocks [3/2] [UU_]
  [=...]  resync =  6.0% (12045888/199141632) 
finish=145.1min speed=21480K/sec

and then i tried:

# mdadm /dev/md1 -a /dev/sdc2
mdadm: hot add failed for /dev/sdc2: Invalid argument

and the dmesg bug output is pasted below.

let me know if there's more info you'd like ... or if i did something dumb 
:)

thanks
-dean

md: trying to hot-add sdc2 to md1 ... 
md: bindsdc2,3
md: bug in file md.c, line 2473

md: **
md: * COMPLETE RAID STATE PRINTOUT *
md: **
md0: sdb1sda1 array superblock:
md:  SB: (V:0.90.0) ID:54a41317.8d4ba606.7dd1ac2d.43c65883 CT:42621381
md: L1 S07823552 ND:4 RD:3 md0 LO:0 CS:0
md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:d1f282a0 E:0024
 D  0:  DISKN:0,sda1(8,1),R:0,S:6
 D  1:  DISKN:1,[dev 00:00](0,0),R:1,S:9
 D  2:  DISKN:2,sdb1(8,17),R:2,S:6
 D  3:  DISKN:3,[dev 00:00](0,0),R:3,S:1
 D  4:  DISKN:4,[dev 00:00](0,0),R:4,S:1
md: THIS:  DISKN:2,sdb1(8,17),R:2,S:6
md: rdev sdb1: O:sdb1, SZ:07823552 F:0 DN:2 6md: rdev superblock:
md:  SB: (V:0.90.0) ID:54a41317.8d4ba606.7dd1ac2d.43c65883 CT:42621381
md: L1 S07823552 ND:4 RD:3 md0 LO:0 CS:0
md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:d1f28438 E:0024
 D  0:  DISKN:0,sda1(8,1),R:0,S:6
 D  1:  DISKN:1,[dev 00:00](0,0),R:1,S:9
 D  2:  DISKN:2,sdb1(8,17),R:2,S:6
 D  3:  DISKN:3,[dev 00:00](0,0),R:3,S:1
 D  4:  DISKN:4,[dev 00:00](0,0),R:4,S:1
md: THIS:  DISKN:2,sdb1(8,17),R:2,S:6
md: rdev sda1: O:sda1, SZ:07823552 F:0 DN:0 6md: rdev superblock:
md:  SB: (V:0.90.0) ID:54a41317.8d4ba606.7dd1ac2d.43c65883 CT:42621381
md: L1 S07823552 ND:4 RD:3 md0 LO:0 CS:0
md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:d1f28424 E:0024
 D  0:  DISKN:0,sda1(8,1),R:0,S:6
 D  1:  DISKN:1,[dev 00:00](0,0),R:1,S:9
 D  2:  DISKN:2,sdb1(8,17),R:2,S:6
 D  3:  DISKN:3,[dev 00:00](0,0),R:3,S:1
 D  4:  DISKN:4,[dev 00:00](0,0),R:4,S:1
md: THIS:  DISKN:0,sda1(8,1),R:0,S:6
md1: sdc2sdb2sda2 array superblock:
md:  SB: (V:0.90.0) ID:94d6f5e3.ce487a3a.fb358d6f.375e923c CT:4262138d
md: L1 S03911744 ND:4 RD:3 md1 LO:0 CS:0
md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:c3e2a424 E:001c
 D  0:  DISKN:0,sda2(8,2),R:0,S:6
 D  1:  DISKN:1,[dev 00:00](0,0),R:1,S:9
 D  2:  DISKN:2,sdb2(8,18),R:2,S:6
 D  3:  DISKN:3,sdc2(8,34),R:3,S:1
 D  4:  DISKN:4,[dev 00:00](0,0),R:4,S:1
md: THIS:  DISKN:2,sdb2(8,18),R:2,S:6
md: rdev sdc2: O:sdc2, SZ:03911744 F:0 DN:-1 6md: rdev superblock:
md:  SB: (V:1.-150761216.0) ID:.f6b49250.f6ab2d80. CT:
md: L-156056320 S-156594432 ND:0 RD:-156573696 md0 LO:0 CS:0
md: UT:f6ab2a80 ST:0 AD:-156573696 WD:0 FD:0 SD:-155937712 CSUM: 
E:
 D 20:  DISKN:483,[dev e3:e3](4194787,8389091),R:12583395,S:16777699
 D 21:  DISKN:134218211,[dev 
e3:e3](138412515,142606819),R:146801123,S:150995427
 D 22:  DISKN:268435939,[dev 
e3:e3](272630243,276824547),R:281018851,S:285213155
 D 23:  DISKN:402653667,[dev 
e3:e3](406847971,411042275),R:415236579,S:419430883
 D 24:  DISKN:536871395,[dev 
e3:e3](541065699,545260003),R:549454307,S:553648611
 D 25:  DISKN:671089123,[dev 
e3:e3](675283427,679477731),R:683672035,S:687866339
 D 26:  DISKN:805306851,[dev 
e3:e3](809501155,813695459),R:817889763,S:822084067
md: THIS:  DISKN:0,[dev 20:67](0,935010407),R:921989223,S:0
md: rdev sdb2: O:sdb2, SZ:03911744 F:0 DN:2 6md: rdev superblock:
md:  SB: (V:0.90.0) ID:94d6f5e3.ce487a3a.fb358d6f.375e923c CT:4262138d
md: L1 S03911744 ND:4 RD:3 md1 LO:0 CS:0
md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:c3e2a5bc E:001c
 D  0:  DISKN:0,sda2(8,2),R:0,S:6
 D  1:  DISKN:1,[dev 00:00](0,0),R:1,S:9
 D  2:  DISKN:2,sdb2(8,18),R:2,S:6
 D  3:  DISKN:3,[dev 00:00](0,0),R:3,S:1
 D  4:  DISKN:4,[dev 00:00](0,0),R:4,S:1
md: THIS:  DISKN:2,sdb2(8,18),R:2,S:6
md: rdev sda2: O:sda2, SZ:03911744 F:0 DN:0 6md: rdev superblock:
md:  SB: (V:0.90.0) ID:94d6f5e3.ce487a3a.fb358d6f.375e923c CT:4262138d
md: L1 S03911744 ND:4 RD:3 md1 LO:0 CS:0
md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 

  1   2   >