Re: nonzero mismatch_cnt with no earlier error
Sorry to hijack the thread a little but I just noticed that the mismatch_cnt for my mirror is at 256. I'd always thought the monthly check done by the mdadm Debian package does repair as well - apparently it doesn't. So I guess I should run repair but I'm wondering ... - is it safe / bugfree considering my oldish software? (mdadm 2.5.2 + linux 2.6.17.4) - is there any way to check which files (if any) have been corrupted? - I have grub installed by hand on both mirror components, but that shouldn't show up as mismatch, should it? The box in question is in production so I'd rather not update mdadm and/or kernel if possible. Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Justin Piszcz wrote: On Sat, 24 Feb 2007, Michael Tokarev wrote: Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt That is what I thought too (I will have to wait until I get another mismatch to verify), but FYI-- Yesterday I had 512 mismatches for my swap partition (RAID1) after I ran the check. I ran repair. I catted the mismatch_cnt again, still 512. I re-ran the check, back to 0. AFAIK the repair action will give you a count of the repairs it does, and will fail a drive if a read does not succeed after the sector is rewritten. That's the way I read it, and the way it seems to work. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
On Sun, 25 Feb 2007, Christian Pernegger wrote: Sorry to hijack the thread a little but I just noticed that the mismatch_cnt for my mirror is at 256. I'd always thought the monthly check done by the mdadm Debian package does repair as well - apparently it doesn't. So I guess I should run repair but I'm wondering ... - is it safe / bugfree considering my oldish software? (mdadm 2.5.2 + linux 2.6.17.4) - is there any way to check which files (if any) have been corrupted? - I have grub installed by hand on both mirror components, but that shouldn't show up as mismatch, should it? The box in question is in production so I'd rather not update mdadm and/or kernel if possible. Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html That is a very good question.. Also I hope you are not running XFS with 2.6.17.4. (corruption bug) Besides that, I wonder if it would be possible (with bitmaps perhaps(?)) to have the kernel increment that and then post via ring buffer/dmesg, something like: kernel: md1: mismatch_cnt: 512, file corrupted: /etc/resolv.conf I would take a performance hit for something like that :) Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
On Saturday February 24, [EMAIL PROTECTED] wrote: But is this not a good opportunity to repair the bad stripe for a very low cost (no complete resync required)? In this case, 'md' knew nothing about an error. The SCSI layer detected something and thought it had fixed it itself. Nothing for md to do. At time of error we actually know which disk failed and can re-write it, something we do not know at resync time, so I assume we always write to the parity disk. md only knows of a 'problem' if the lower level driver reports one. If it reports a problem for a write request, md will fail the device. If it reports a problem for a read request, md will try to over-write correct data on the failed block. But if the driver doesn't report the failure, there is nothing md can do. When performing a check/repair md looks for consistencies and fixes the 'arbitrarily'. For raid5/6, it just 'corrects' the parity. For raid1/10, it chooses one block and over-writes the other(s) with it. Mapping these corrections back to blocks in files in the filesystem is extremely non-trivial. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Ok, so hearing all the excitement I ran a check on a multi-disk RAID-1. One of the RAID-1 disks failed out, maybe by coincidence but presumably due to the check. (I also have another disk in the array deliberately removed as a backup mechanism.) And of course there is a big mismatch count. Questions: will repair do the right thing for multidisk RAID-1, e.g. vote or something? Do I need a special version of mdadm to do this safely? What am I forgetting to ask? Jeff # cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdf1[0] sdb1[4] sdd1[6](F) sdc1[2] sde1[1] 488383936 blocks [6/4] [UUU_U_] # cat /sys/block/md1/md/mismatch_cnt 128 # cat /proc/version Linux version 2.6.17-2-amd64 (Debian 2.6.17-7) ([EMAIL PROTECTED]) (gcc version 4.1.2 20060814 (prerelease) (Debian 4.1.1-11)) #1 SMP Thu Aug 24 16:13:57 UTC 2006 # dpkg -l | grep mdadm ii mdadm 1.9.0-4sarge1 Manage MD devices aka Linux Software Raid - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Yes, I meant repair, sorry. I checked my bash history and I did indeed order a repair (echo repair /sys/block/md0/md/sync_action). I think I called it a resync because that's what /proc/mdstat told me it was doing. On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote: A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Ahh, perhaps Neil can fix that? ;) Cat /sys/block/md0/md/sync_action will tell you what it is really doing. On Sat, 24 Feb 2007, Jason Rainforest wrote: Yes, I meant repair, sorry. I checked my bash history and I did indeed order a repair (echo repair /sys/block/md0/md/sync_action). I think I called it a resync because that's what /proc/mdstat told me it was doing. On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote: A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
On Sat, 24 Feb 2007, Michael Tokarev wrote: Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt That is what I thought too (I will have to wait until I get another mismatch to verify), but FYI-- Yesterday I had 512 mismatches for my swap partition (RAID1) after I ran the check. I ran repair. I catted the mismatch_cnt again, still 512. I re-ran the check, back to 0. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html