Re: nonzero mismatch_cnt with no earlier error
Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Yes, I meant repair, sorry. I checked my bash history and I did indeed order a repair (echo repair /sys/block/md0/md/sync_action). I think I called it a resync because that's what /proc/mdstat told me it was doing. On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote: A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Ahh, perhaps Neil can fix that? ;) Cat /sys/block/md0/md/sync_action will tell you what it is really doing. On Sat, 24 Feb 2007, Jason Rainforest wrote: Yes, I meant repair, sorry. I checked my bash history and I did indeed order a repair (echo repair /sys/block/md0/md/sync_action). I think I called it a resync because that's what /proc/mdstat told me it was doing. On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote: A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote: Of course you could just run repair but then you would never know that mismatch_cnt was 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check sync_action 2. If mismatch_cnt 0 then run: 3. echo repair sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So what does this final count mean? This leads me to ask: why bother doing a check if I will always run a resync after an error - better run a resync in the first place? -- Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/ attach .zip as .dat - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
On Sat, 24 Feb 2007, Michael Tokarev wrote: Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt That is what I thought too (I will have to wait until I get another mismatch to verify), but FYI-- Yesterday I had 512 mismatches for my swap partition (RAID1) after I ran the check. I ran repair. I catted the mismatch_cnt again, still 512. I re-ran the check, back to 0. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote: And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). i had a small c program + perl script that would take a badblocks list and figure out which files on an xfs filesystem were trashed, though in the case of xfs it's somewhat easier because you can dump the extents for a file something more generic wouldn't be hard to make work, it also wouldn't be hard to extend this to inodes in some cases though im not sure that there is much you can do there beyond fsck - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
In contrast, ever since these holes appeared, drive failures became the norm. wow, great conspiracy theory! maybe the hole is plugged at the factory with a substance which evaporates at 1/warranty-period ;) seriously, isn't it easy to imagine a bladder-like arrangement that permits equilibration without net flow? disk spec-sheets do limit this - I checked the seagate 7200.10: 10k feet operating, 40k max. amusingly -200 feet is the min either way... Doe anyone rememnber that you had to let you drives acclimate to your machine room for a day or so before you used them. The problem is, that's not enough; the room temperature/humidity has to be controlled too. In a desktop environment, that's not really feasible. 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy to me. in fact, I frequently ask people to justify the assumption that a good machineroom needs tight control over humidity. (assuming, like most machinerooms, you aren't frequently handling the innards.) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html