RE: errors on boot

Bruno Prior Mon, 22 Nov 1999 05:26:57 -0800
Michel,

Thanks for that. It makes things much clearer.

> Nov 16 13:36:28 korak kernel: sdb4's event counter: 00000016
> Nov 16 13:36:28 korak kernel: sda4's event counter: 00000017
> Nov 16 13:36:28 korak kernel: md: superblock update time inconsistency
> -- using
> the most recent one
> Nov 16 13:36:28 korak kernel: freshest: sda4
> Nov 16 13:36:28 korak kernel: md2: kicking faulty sdb4!

This is the crucial part. I think it speaks for itself. The mirrors are out of
sync, so the RAID code assumes that there is a problem and kicks the partition
that was updated less recently (sdb4) out of the array. Did you have an unclean
shutdown, or was there a problem with the RAID before the last shutdown such
that sdb4 was kicked out of the array? Something like this must have happened.

Anyway, the solution is very simple. Just do "raidhotadd /dev/md2 /dev/sdb4".
You don't need to "raidhotremove /dev/md2 /dev/sdb4", because it has already
been kicked out of the array at startup. This will add /dev/sdb4 back into the
array. The RAID code should start resyncing it automatically once it has been
added back in. Have a look in /proc/mdstat and you should see how the resync'ing
process is going. Make sure you don't shutdown before resync'ing has completed,
or you will be back to square one. But you can use the array quite happily while
it is resyncing.

It would be a good idea to try to figure out why the mirrors were out of sync,
in case this reveals a problem. If it was an unclean shutdown, then there's no
problem (apart from making sure you don't do it again). But if sdb4 had been
kicked out of the array, you need to know why to make sure it doesn't happen
again. If this was the case, you will need to check back through your syslog to
try to spot when it happened and what the reasons were. Or if you can't be
bothered to do this, at least keep an eye on /proc/mdstat from now on (maybe
using one of the monitoring scripts which are mentioned on this list from time
to time), to make sure that you know if it happens again.

As for the problems thrown up by the other suggestions:

> > Well, you should not try to remove /dev/sda as its not part
> > of /dev/md2.
> >
> > Try /dev/sda4 ...
>
> I tried that also.  In fact, I tried all combinations, sorry I wasn't
> explicit about that.  I get an error that /dev/sdb4 is not in the array,
> and /dev/sda4 is busy.

This is exactly what should happen. sdb4 has been kicked out of the array (as we
have seen from dmesg), so you can't remove what isn't there. And sda4 is the
only partition in the array, so the RAID code can't let you remove it or the
whole array would fail (you can't have an array with no active partitions). So
the RAID code tells you sda4 is busy and carries on using it. It is protecting
you from yourself. As I said above, you don't need to raidhotremove sdb4,
because it has already been kicked out. Just raidhotadd it.

> > I don't know what
> > the [U_] means,
> > it means the second partiotion is gone
> >
> > U=ok
> > _=fubar
>
> Well I figured that.  So how do I find out if this is a logical error or
> physical?  The other two partions on the same drive work dandy with
> their RAID, it looks like just this one partition on the drive is bad.

You've worked it out for yourself. If other partitions on the drive are OK, it's
unlikely that this is a physical problem. The full dmesg messages confirmed
that. But, as I say, you had better work out why the partition dropped out of
the array in the first place, just in case it is a physical problem or a
corruption problem.

> I guess here's the situation, I have two partition, sda4 and sdb4, that
> comprise md2.  One of them is good and is still running, the other is,
> as someone put it, fubar.  How can I tell the system to rebuild the bad
> from the good?  Is it not that easy?

You don't tell the system to rebuild the bad from the good. You work out which
is bad (using /proc/mdstat and dmesg/syslog) and then, if it is not a physical
problem, you add it back into the array with raidhotadd. The array already
contains the good partition, so the RAID code takes care of rebuilding the bad
from the good, simply by adding it into the array.

Cheers,


Bruno Prior         [EMAIL PROTECTED]
RE: errors on boot

Reply via email to