mismatch_cnt != 0, member content mismatch, but md says the mirror is good

Ben Scott Sun, 01 Nov 2009 19:02:18 -0800

  CentOS 5.4.  Running kernel is 2.6.18-92.1.22.el5.  The system has
two disks, each with two partitions, making up two md mirror devices.
md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk)
and holds an LVM PE.    The following arrived in my mailbox today:

On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon <r...@liberty.gnhlug.org> wrote:
> /etc/cron.weekly/99-raid-check:
>
> WARNING: mismatch_cnt is not 0 on /dev/md0

  Investigation finds:

/proc/mdstat reports everything is peachy for both mirrors.  "[2/2] [UU]"

Under /sys/block/md0/md/ I find the following:

        array_state: clean
        mismatch_cnt: 256
        rd{0,1}/errors: 0
        rd{0,1}/state: in_snyc

  Google finds lots of people reporting similar, but nothing
conclusive or particularly pertinent to this situation.  Lots of
people saying that swap can cause this (because swap can commit a
block to one member, then learn it won't ever re-read that block, and
so won't bother committing the other member), but this is the /boot
filesystem, not swap.  (swap is in an LV; the md device backing that
LVM's sole PE reports a mismatch_cnt of zero.)

  I did find some people saying this started happening after CentOS
5.3 -> 5.4.  I did do that recently.  One person said the "raid-check"
was added in 5.4.  So I presume this mismatch_cnt might have been
non-zero for ages, and I just never knew to look before now.
mdmonitor has been running, but it mainly reports if a RAID member
goes offline, and as noted, md is reporting all's quiet on the western
front.

  I tried dismounting the /boot filesystem and running some tests.
(Since it's a separate partition and md device, and outside of LVM, I
can poke at it without taking the system down.)

  "e2fsck -f -n" says /dev/md0 is okay.

  I tried stopping the RAID device with "mdadm --stop /dev/md0", then
sync'ing disks.  Then I ran "cmp /dev/sda1 /dev/sdb1".  The result:

        /dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880

  So the two mirror members are **NOT** identical.  That's usually bad.

  Running "e2fsck -f -n" on each member says no trouble found.  That
implies whatever the mismatch is, it is not in filesystem metadata.

  Running a "badblocks" read-only test on each member says no read errors.

  mdadm says the MD superblocks are okay, and comparing the two finds
most things are the same -- only the checksum and device relationships
differ (expected).

  One nice thing about simple mirrors is that you can mount the
members read-only and examine the contents without breaking the mirror
set.  So:

        liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1
        liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1
        liberty$ sudo diff -r sda1 sdb1
        Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ
        liberty$

  (You have to mount as ext2 because ext3 will replay a journal even
if you said "read-only".)

  It may be normal for the GRUB stage2 to differ in this
configuration.  There may be device numbers encoded into them.  GRUB
was installed on each disk separately, by booting from floppy, so that
would do it.  Or it could be one disk has an undetected bad block and
the boot loader on that disk is shot.

  No other differences detected in file data, though.  So between fsck
and diff, it looks like most of the contents are intact.  Maybe all of
them.

  I'm unsure as to how to proceed.

  The general procedure for repairing a broken mirror is to resync
from the good member, assuming you can determine which is good.  My
problem is, I'm not sure which is the good member, or even if there
*is* a good member: If GRUB writes different device numbers into the
boot stage files, the two disks necessarily won't match.  Which, come
to think of it, is probably something to worry about, since a legit
mirror resync will scrogg that.

  "smartctl -a" reveals something that may be relevant.  sda reports
several non-zero values in the "Error counter log" section.  No
uncorrectable errors, but ECC has been used.  At the same time, sdb
reports all zeros for those same values.  Further, the counts for sda
have increased since the disks were installed.  (I saved the output of
"smartctl -a" back then.  Now you see why.)  Now, ECC usage is not an
automatic cause for alarm on a modern hard disk, but the fact that sda
is non-zero and increasing while sdb is zero and flat suggests sdb is
in better overall health.  However, this probably has nothing to do
with the mirror mismatch, since both disks report zero *uncorrectable*
errors.  Uncorrectable media defects would certainly cause a mirror
mismatch, but the drives think they've been able to handle everything
so far.

  There are newer kernels available; the system hasn't been rebooted
in 251 days.  But I'm somewhat loathe to try rebooting with /boot in a
suspect state.

  The thing I find really confusing is why "mismatch_cnt" can be
non-zero while the rest of the in-kernel md monitoring stuff reports
everything is good.

  Anyone here have suggestions, ideas, knowledge, or even wild schemes?

-- Ben
_______________________________________________
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/

mismatch_cnt != 0, member content mismatch, but md says the mirror is good

Reply via email to