Re: Raid1 - dangerous resync after power-failure?

Mike Black Thu, 30 Mar 2000 04:54:24 -0800
I've been bit by this fact.  I had one disk fail -- the spare kicked in --
then during the resync got ANOTHER bad sector an an area of the disk that
wasn't used much (tail end of the thing).  The whole RAID hosed then
(RAID5).  I was able to recover but it was down for a while.

RAID resync should at least say "I'm doing a resync -- therefore do NOT tag
any other disks bad because of the resync operation".   In fact, resync
should be able to try and retrieve the sector from the previously bad disk
to try and reconstruct, shouldn't it?  Of course this would only work if the
bad disk was still operational (which has happened to me about 4 times so
far -- haven't had a disk "just die" yet on my server).

Here's what I think should happen:
#1 - Disk gets bad sector -- dropped from array
#2 - Resync starts reconstruction
#3 - Resync gets bad sector on another disk -- stops resync with
error/warning
#4 - Sysadmin does bad block scan of BOTH disks remapping bad blocks -- we
now have a bad block on both disks but still have 2-out-3 blocks for each
bad block across all 3 disks.
#5 - Resync restarted with special flag - that says "map sector# on disk x
to disk y" and "map sector# on disk y to disk x".  Say you have sda, sdb,
and sdc.  sdc fails originally, then sdb.  sdb has bad sector#1000 and sdc
has bad sector#2000.  You do "raidhotadd /dev/md0 /dev/sdc
rebuild=sdb/1000/sdc" -- this will raid-hot-add "sdc" to the array -- but
when the resync gets to block#1000 it will use sdc block#1000 instead (which
in all likelihood is still good).


________________________________________
Michael D. Black   Principal Engineer
[EMAIL PROTECTED]  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Sam" <[EMAIL PROTECTED]>
To: "Thomas Kotzian" <[EMAIL PROTECTED]>
Cc: "Linux-RAID" <[EMAIL PROTECTED]>
Sent: Wednesday, March 29, 2000 4:23 PM
Subject: Re: Raid1 - dangerous resync after power-failure?


OK, regardless of how the failure occurs,  my point is that a
resync is a potentially dangerous operation if you don't
know beforehand whether the source disk has bad sectors or not.
So I don't think a resync should be performed except when
absolutely necessary, or unless the source disk is known to
be absolutely free from errors.

Can someone answer my original question which was:

     Could the SB_CLEAN flag be eliminated to reduce the
     risk of a resync damaging good data?

> > I hope you're making a joke.
>
> that's not a joke - i found it in a document describing raid. I hope I can
find it
> again, so I can send it to you!
>
> > The problem is that I'm trying to use the software in
> > the "real world" where the most likely need for raid-1 is
> > due to power problems.
>
> Therefore you should use UPS. not raid. And even when you are using a
journalling
> file-system it's possible to lose data.
>
> > At least that's been my experience over many years of
> > doing sysadmin - most disk failures seem to occur after
> > some sort of power outage.  Either the power goes out,
> > or somone accidentally pulls the plug, etc.
>
> I had a server running a raid5 with 3 disks and after some power failures
the system
> didn't want to mount the disk again. The other - un-raided disks didn't
have any
> problem. - After reinstallation and adding a UPS all is fine (now for 8
months).
>
> >
> > Sam
> >
> > Thomas Kotzian wrote:
> >
> > > raid wasn't invented to survive a power failure but a disk-failure!
> > >
> > > Thomas
> > >
> > > ----- Original Message -----
> > > From: "Sam" <[EMAIL PROTECTED]>
> > > To: <[EMAIL PROTECTED]>
> > > Sent: Monday, March 27, 2000 1:00 PM
> > > Subject: Raid1 - dangerous resync after power-failure?
> > >
> > > > I'm setting up a web server with Raid-1, using raidtools 0.90-5
> > > > and linux kernel 2.2.12 (this is the Redhat 6.1 distr).  I want to
> > > > mirror all my data across two disks (hda and hdc).
> > > >
> > > > The problem I've noticed from testing is that if I shut off the
power
> > > > and then reboot, the raidtools software will start re-syncing the
> > > > mirrors,
> > > > even though there was no write activity at all when the power went
off
> > > > and even
> > > > though both parts of the mirror have the exact same event counter.
> > > >
> > > > The problem I see with this is as follows:
> > > >
> > > >     - Assume a power outage hits and wipes out some sectors on the
> > > >       hda disk, but leaves the superblock alone.  I think this
scenario
> > > >       is a fairly likely one.
> > > >
> > > >     - After the power outage, the system boots up and starts up a
> > > > resync,
> > > >        copying data from hda to hdc
> > > >
> > > >     - The system tries to access the bad sectors on hda
> > > >
> > > > What would happen at this point?  I assume the data would be lost,
> > > > since hdc is undergoing a re-sync, and the sectors on hda are
already
> > > > bad.
> > > > Even though at boot time hdc contained good copies of these sectors,
> > > > the raid software starting re-syncing onto hdc and lost that data.
If
> > > > however
> > > > the raid code had just left hdc alone it could've recovered these
> > > > sectors.
> > > >
> > > > I looked at the raidtools code, and it looks to me what is happening
is
> > > > that
> > > > there is a SB_CLEAN flag in the superblock that is set to false when
> > > > raid
> > > > is started on an md device.  This SB_CLEAN flag is only set to true
if a
> > > > clean
> > > > shutdown is performed.  So if a power outage hits, this flag is
always
> > > > going
> > > > to be false since no clean shutdown is performed.  At boot time the
md
> > > > code
> > > > then checks the SB_CLEAN flag and if it is false a resync is
performed.
> > > >
> > > > It seems to me that a resync should only be required if the system
is in
> > > > the
> > > > middle of a write where some data has been sent to one disk, but not
yet
> > > > to another.
> > > > I think the event counter already performs this function so I don't
see
> > > > why the
> > > > SB_CLEAN flag is even needed.
> > > >
> > > > What do you think?  Could this SB_CLEAN flag be eliminated to reduce
the
> > > >
> > > > risk of a resync damaging good data?
> > > >
> >
Re: Raid1 - dangerous resync after power-failure?

Reply via email to