Re: Feature Request/Suggestion - Drive Linking

2006-09-05 Thread dean gaudet
On Mon, 4 Sep 2006, Bill Davidsen wrote:

 But I think most of the logic exists, the hardest part would be deciding what
 to do. The existing code looks as if it could be hooked to do this far more
 easily than writing new. In fact, several suggested recovery schemes involve
 stopping the RAID5, replacing the failing drive with a created RAID1, etc. So
 the method is valid, it would just be nice to have it happen without human
 intervention.

you don't actually have to stop the raid5 if you're using bitmaps... you 
can just remove the disk, create a (superblockless) raid1 and put the 
raid1 back in place.

the whole process could be handled a lot like mdadm handles spare groups 
already... there isn't a lot more kernel support required.

the largest problem is if a power failure occurs before the process 
finishes.  i'm 95% certain that even during a reconstruction, raid1 writes 
go to all copies even if the write is beyond the current sync position[1] 
-- so the raid5 superblock would definitely have been written to the 
partial disk... so that means on a reboot there'll be two disks which look 
like they're both the same (valid) component of the raid5, and one of them 
definitely isn't.

maybe there's some trick to handle this situation -- aside from ensuring 
the array won't come up automatically on reboot until after the process 
has finished.

one way to handle it would be to have an option for raid1 resync which 
suppresses writes which are beyond the resync position... then you could 
zero the new disk superblock to start with, and then start up the resync 
-- then it won't have a valid superblock until the entire disk is copied.

-dean

[1] there's normally a really good reason for raid1 to mirror all writes 
even if they're beyond the resync point... consider the case where you 
have a system crash and have 2 essentially idential mirrors which then 
need a resync... and the source disk dies during the resync.

if all writes have been mirrored then the other disk is already useable 
(in fact it's essentially arbitrary which of the mirrors was used for the 
resync source after the crash -- they're all equally (un)likely to have 
the most current data)... without bitmaps this sort of thing is a common 
scenario and certainly saved my data more than once.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-09-04 Thread Bill Davidsen

Michael Tokarev wrote:


Tuomas Leikola wrote:
[]
 


Here's an alternate description. On first 'unrecoverable' error, the
disk is marked as FAILING, which means that a spare is immediately
taken into use to replace the failing one. The disk is not kicked, and
readable blocks can still be used to rebuild other blocks (from other
FAILING disks).

The rebuild can be more like a ddrescue type operation, which is
probably a lot faster in the case of raid6, and the disk can be
automatically kicked after the sync is done. If there is no read
access to the FAILING disk, the rebuild will be faster just because
seeks are avoided in a busy system.
   



It's not that simple.  The issue is with writes.  If there's a failing
disk, md code will need to keep track of up-to-date, or good sectors
of it vs obsolete ones.  Ie, when write fails, the data in that block
is either unreadable (but can become readable on the next try, say, after
themperature change or whatnot), or readable but contains old data, or
is readable but contains some random garbage.  So at least that block(s)
of the disk should not be copied to the spare during resync, and should
not be read at all, to avoid returning wrong data to userspace.  In short,
if the array isn't stopped (or changed to read-only), we should watch for
writes, and remember which ones are failed.  Which is some non-trivial
change.  Yes, bitmaps somewhat helps here.
 

It would seem that much of the code needed is already there. When doing 
the recovery the spare can be treated as a RAID1 copy of the failing 
drive, with all sectors out of date. Then the sectors from the failing 
drive can be copied, using reconstruction if needed, until there is a 
valid copy on the new drive.


There are several decision points during this process:
- do writes get tried to the failing drive, or just the spare?
- do you mark the failing drive as failed after the good copy is created?

But I think most of the logic exists, the hardest part would be deciding 
what to do. The existing code looks as if it could be hooked to do this 
far more easily than writing new. In fact, several suggested recovery 
schemes involve stopping the RAID5, replacing the failing drive with a 
created RAID1, etc. So the method is valid, it would just be nice to 
have it happen without human intervention.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-09-03 Thread Tuomas Leikola

This way I could get the replacement in and do the resync without
actually having to degrade the array first.


snip


2) This sort of brings up a subject I'm getting increasingly paranoid
about. It seems to me that if disk 1 develops a unrecoverable error at
block 500 and disk 4 develops one at 55,000 I'm going to get a double
disk failure as soon as one of the bad blocks is read


Here's an alternate description. On first 'unrecoverable' error, the
disk is marked as FAILING, which means that a spare is immediately
taken into use to replace the failing one. The disk is not kicked, and
readable blocks can still be used to rebuild other blocks (from other
FAILING disks).

The rebuild can be more like a ddrescue type operation, which is
probably a lot faster in the case of raid6, and the disk can be
automatically kicked after the sync is done. If there is no read
access to the FAILING disk, the rebuild will be faster just because
seeks are avoided in a busy system.

Personally I feel this is a good idea, count my vote in.

- Tuomas

--
VGER BF report: U 0.505245
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-09-03 Thread Michael Tokarev
Tuomas Leikola wrote:
[]
 Here's an alternate description. On first 'unrecoverable' error, the
 disk is marked as FAILING, which means that a spare is immediately
 taken into use to replace the failing one. The disk is not kicked, and
 readable blocks can still be used to rebuild other blocks (from other
 FAILING disks).
 
 The rebuild can be more like a ddrescue type operation, which is
 probably a lot faster in the case of raid6, and the disk can be
 automatically kicked after the sync is done. If there is no read
 access to the FAILING disk, the rebuild will be faster just because
 seeks are avoided in a busy system.

It's not that simple.  The issue is with writes.  If there's a failing
disk, md code will need to keep track of up-to-date, or good sectors
of it vs obsolete ones.  Ie, when write fails, the data in that block
is either unreadable (but can become readable on the next try, say, after
themperature change or whatnot), or readable but contains old data, or
is readable but contains some random garbage.  So at least that block(s)
of the disk should not be copied to the spare during resync, and should
not be read at all, to avoid returning wrong data to userspace.  In short,
if the array isn't stopped (or changed to read-only), we should watch for
writes, and remember which ones are failed.  Which is some non-trivial
change.  Yes, bitmaps somewhat helps here.

/mjt

-- 
VGER BF report: H 0.418675
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Feature Request/Suggestion - Drive Linking

2006-08-29 Thread Neil Bortnak
Hi Everybody,

I had this major recovery last week after a hardware failure monkeyed
things up pretty badly. About half way though I had a couple of ideas
and I thought I'd suggest/ask them.

1) Drive Linking: So let's say I have a 6 disk RAID5 array and I have
reason to believe one of the drives will fail (funny noises, SMART
warnings or it's *really* slow compared to the other drives, etc). It
would be nice to put in a new drive, link it to the failing disk so that
it copies all of the data to the new one and mirrors new writes as they
happen.

This way I could get the replacement in and do the resync without
actually having to degrade the array first. When it's done, pulling out
the failing disk automatically breaks the link and everything goes back
to normal. Or, if you break the link in software, it removes the old
disk from the array and wipes out the superblock automatically.

Maybe there is a way to do this already and I just missed it, but I
don't think so. I'm not really keen on degrading the array just in case
the system finds an unrecoverable error on one of the other disks during
the resync and the whole thing comes crashing down in a dual disk
failure. In fact, I'm not keen on degrading the array period.


2) This sort of brings up a subject I'm getting increasingly paranoid
about. It seems to me that if disk 1 develops a unrecoverable error at
block 500 and disk 4 develops one at 55,000 I'm going to get a double
disk failure as soon as one of the bad blocks is read (or some other
system problem -makes it look like- some random block is
unrecoverable). Such an error should not bring the whole thing to a
crashing halt. I know I can recover from that sort of error manually,
but yuk.

It seems to me that as arrays get larger and larger, failure mechanisms
better than wipe out 750G of mirror and put the array in jeopardy
because a single block is unrecoverable need to be developed. Can bad
block redirection help us add a layer of defense, at least in the short
term? Granted, if the disk block is unrecoverable because all the spares
are used up, the chances are the drive will die off soon anyway, but I'd
rather get one last kick at doing a clean rebuild (maybe a la the disk
linking idea above) before ejecting the drive. The current methods
employed by RAID 1-6 seem a bit crude. Fine for 20 years ago, but
showing it's age with today's increasingly massive data sets.

I'm quite thankful for all the MD work and this isn't a criticism. I'm
merely interested in the problem and wonder at other people's thoughts
on the matter. Maybe we can move from something that paints in large
strokes like RAID 1-6 and look towards an all-new RAID-OMG. I'm
basically thinking it's prudent to apply security's idea of defense in
depth to drive safety.


3) So this last rebuild I had to do was for a system with a double disk
failure and no backup (no, not my system as I would have had a backup as
we all know raid doesn't protect against a lot of threats). I managed to
get it done but I ended up writing a lot of offline, userspace
verification and resync tools in perl and C and editing the superblocks
with hexedit.

An extra tool to edit superblock fields would be very keen.

If no one is horrified by the fact I did the other recovery tools in
perl, I would be happy to clean them up and submit them. I wrote one to
verify a given disk's data vs. the other disks and report errors
(optionally fixing them). It also has a range feature so you don't have
to do the whole disk. The other is similar, but I built it for high
speed bulk resyncing from userspace (no need to have RAID in the
kernel).


4) And finally (for today at least), can mdadm do the equivalent of
NetApp's or 3Ware's disk scrubbing? I know I can check an array manually
with a /sys entry, but it would be cool to have mdadm optionally run
these checks and continually rerun them when they were finished for all
the arrays on the system. Just part of it's monitoring duties really.
For someone like me, I only care about data integrity and uptime, not
speed. I heard something like that was going in, but I don't know it's
status.

Thanks!

Neil

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-08-29 Thread dean gaudet
On Wed, 30 Aug 2006, Neil Bortnak wrote:

 Hi Everybody,
 
 I had this major recovery last week after a hardware failure monkeyed
 things up pretty badly. About half way though I had a couple of ideas
 and I thought I'd suggest/ask them.
 
 1) Drive Linking: So let's say I have a 6 disk RAID5 array and I have
 reason to believe one of the drives will fail (funny noises, SMART
 warnings or it's *really* slow compared to the other drives, etc). It
 would be nice to put in a new drive, link it to the failing disk so that
 it copies all of the data to the new one and mirrors new writes as they
 happen.

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

works for any raid level actually.


 2) This sort of brings up a subject I'm getting increasingly paranoid
 about. It seems to me that if disk 1 develops a unrecoverable error at
 block 500 and disk 4 develops one at 55,000 I'm going to get a double
 disk failure as soon as one of the bad blocks is read (or some other
 system problem -makes it look like- some random block is
 unrecoverable). Such an error should not bring the whole thing to a
 crashing halt. I know I can recover from that sort of error manually,
 but yuk.

Neil made some improvements in this area as of 2.6.15... when md gets a 
read error it won't knock the entire drive out immediately -- it first 
attempts to reconstruct the sectors from the other drives and write them 
back.  this covers a lot of the failure cases because the drive will 
either successfully complete the write in-place, or use its reallocation 
pool.  the kernel logs when it makes such a correction (but the log wasn't 
very informative until 2.6.18ish i think).

if you watch SMART data (either through smartd logging changes for you, or 
if you diff the output regularly) you can see this activity happen as 
well.

you can also use the check/repair sync_actions to force this to happen 
when you know a disk has a Current_Pending_Sector (i.e. pending read 
error).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html