CentOS 5.4. Running kernel is 2.6.18-92.1.22.el5. The system has two disks, each with two partitions, making up two md mirror devices. md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk) and holds an LVM PE. The following arrived in my mailbox today:
On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon <r...@liberty.gnhlug.org> wrote: > /etc/cron.weekly/99-raid-check: > > WARNING: mismatch_cnt is not 0 on /dev/md0 Investigation finds: /proc/mdstat reports everything is peachy for both mirrors. "[2/2] [UU]" Under /sys/block/md0/md/ I find the following: array_state: clean mismatch_cnt: 256 rd{0,1}/errors: 0 rd{0,1}/state: in_snyc Google finds lots of people reporting similar, but nothing conclusive or particularly pertinent to this situation. Lots of people saying that swap can cause this (because swap can commit a block to one member, then learn it won't ever re-read that block, and so won't bother committing the other member), but this is the /boot filesystem, not swap. (swap is in an LV; the md device backing that LVM's sole PE reports a mismatch_cnt of zero.) I did find some people saying this started happening after CentOS 5.3 -> 5.4. I did do that recently. One person said the "raid-check" was added in 5.4. So I presume this mismatch_cnt might have been non-zero for ages, and I just never knew to look before now. mdmonitor has been running, but it mainly reports if a RAID member goes offline, and as noted, md is reporting all's quiet on the western front. I tried dismounting the /boot filesystem and running some tests. (Since it's a separate partition and md device, and outside of LVM, I can poke at it without taking the system down.) "e2fsck -f -n" says /dev/md0 is okay. I tried stopping the RAID device with "mdadm --stop /dev/md0", then sync'ing disks. Then I ran "cmp /dev/sda1 /dev/sdb1". The result: /dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880 So the two mirror members are **NOT** identical. That's usually bad. Running "e2fsck -f -n" on each member says no trouble found. That implies whatever the mismatch is, it is not in filesystem metadata. Running a "badblocks" read-only test on each member says no read errors. mdadm says the MD superblocks are okay, and comparing the two finds most things are the same -- only the checksum and device relationships differ (expected). One nice thing about simple mirrors is that you can mount the members read-only and examine the contents without breaking the mirror set. So: liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1 liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1 liberty$ sudo diff -r sda1 sdb1 Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ liberty$ (You have to mount as ext2 because ext3 will replay a journal even if you said "read-only".) It may be normal for the GRUB stage2 to differ in this configuration. There may be device numbers encoded into them. GRUB was installed on each disk separately, by booting from floppy, so that would do it. Or it could be one disk has an undetected bad block and the boot loader on that disk is shot. No other differences detected in file data, though. So between fsck and diff, it looks like most of the contents are intact. Maybe all of them. I'm unsure as to how to proceed. The general procedure for repairing a broken mirror is to resync from the good member, assuming you can determine which is good. My problem is, I'm not sure which is the good member, or even if there *is* a good member: If GRUB writes different device numbers into the boot stage files, the two disks necessarily won't match. Which, come to think of it, is probably something to worry about, since a legit mirror resync will scrogg that. "smartctl -a" reveals something that may be relevant. sda reports several non-zero values in the "Error counter log" section. No uncorrectable errors, but ECC has been used. At the same time, sdb reports all zeros for those same values. Further, the counts for sda have increased since the disks were installed. (I saved the output of "smartctl -a" back then. Now you see why.) Now, ECC usage is not an automatic cause for alarm on a modern hard disk, but the fact that sda is non-zero and increasing while sdb is zero and flat suggests sdb is in better overall health. However, this probably has nothing to do with the mirror mismatch, since both disks report zero *uncorrectable* errors. Uncorrectable media defects would certainly cause a mirror mismatch, but the drives think they've been able to handle everything so far. There are newer kernels available; the system hasn't been rebooted in 251 days. But I'm somewhat loathe to try rebooting with /boot in a suspect state. The thing I find really confusing is why "mismatch_cnt" can be non-zero while the rest of the in-kernel md monitoring stuff reports everything is good. Anyone here have suggestions, ideas, knowledge, or even wild schemes? -- Ben _______________________________________________ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/