Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
Of course you could just run repair but then you would never know that 
mismatch_cnt was  0.


Justin.

On Sat, 24 Feb 2007, Justin Piszcz wrote:


Perhaps,

The way it works (I believe is as follows)

1. echo check  sync_action
2. If mismatch_cnt  0 then run:
3. echo repair  sync_action
4. Re-run #1
5. Check to make sure it is back to 0.

Justin.

On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:


I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

--
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Jason Rainforest
I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of
the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:
 Of course you could just run repair but then you would never know that 
 mismatch_cnt was  0.
 
 Justin.
 
 On Sat, 24 Feb 2007, Justin Piszcz wrote:
 
  Perhaps,
 
  The way it works (I believe is as follows)
 
  1. echo check  sync_action
  2. If mismatch_cnt  0 then run:
  3. echo repair  sync_action
  4. Re-run #1
  5. Check to make sure it is back to 0.
 
  Justin.
 
  On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:
 
  I did a resync since, which ended up with the same mismatch_cnt of 184.
  I noticed that the count *was* reset to zero when the resync started,
  but ended up with 184 (same as after the check).
  
  I thought that the resync just calculates fresh parity and does not
  bother checking if it is different. So what does this final count mean?
  
  This leads me to ask: why bother doing a check if I will always run
  a resync after an error - better run a resync in the first place?
  
  -- 
  Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
 attach .zip as .dat
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz

A resync?  You're supposed to run a 'repair' are you not?

Justin.

On Sat, 24 Feb 2007, Jason Rainforest wrote:


I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of
the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:

Of course you could just run repair but then you would never know that
mismatch_cnt was  0.

Justin.

On Sat, 24 Feb 2007, Justin Piszcz wrote:


Perhaps,

The way it works (I believe is as follows)

1. echo check  sync_action
2. If mismatch_cnt  0 then run:
3. echo repair  sync_action
4. Re-run #1
5. Check to make sure it is back to 0.

Justin.

On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:


I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

--
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Jason Rainforest
Yes, I meant repair, sorry. I checked my bash history and I did indeed
order a repair (echo repair /sys/block/md0/md/sync_action). I think I
called it a resync because that's what /proc/mdstat told me it was
doing.

On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote:
 A resync?  You're supposed to run a 'repair' are you not?
 
 Justin.
 
 On Sat, 24 Feb 2007, Jason Rainforest wrote:
 
  I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
  multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
  +).
 
  I then ordered a resync. The mismatch_cnt returned to 0 at the start of
  the resync, but around the same time that it went up to 8 with the
  check, it went up to 8 in the resync. After the resync, it still is 8. I
  haven't ordered a check since the resync completed.
 
 
  On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:
  Of course you could just run repair but then you would never know that
  mismatch_cnt was  0.
 
  Justin.
 
  On Sat, 24 Feb 2007, Justin Piszcz wrote:
 
  Perhaps,
 
  The way it works (I believe is as follows)
 
  1. echo check  sync_action
  2. If mismatch_cnt  0 then run:
  3. echo repair  sync_action
  4. Re-run #1
  5. Check to make sure it is back to 0.
 
  Justin.
 
  On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:
 
  I did a resync since, which ended up with the same mismatch_cnt of 184.
  I noticed that the count *was* reset to zero when the resync started,
  but ended up with 184 (same as after the check).
 
  I thought that the resync just calculates fresh parity and does not
  bother checking if it is different. So what does this final count mean?
 
  This leads me to ask: why bother doing a check if I will always run
  a resync after an error - better run a resync in the first place?
 
  --
  Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
   attach .zip as .dat
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz

Ahh, perhaps Neil can fix that? ;)

Cat /sys/block/md0/md/sync_action will tell you what it is really doing.


On Sat, 24 Feb 2007, Jason Rainforest wrote:


Yes, I meant repair, sorry. I checked my bash history and I did indeed
order a repair (echo repair /sys/block/md0/md/sync_action). I think I
called it a resync because that's what /proc/mdstat told me it was
doing.

On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote:

A resync?  You're supposed to run a 'repair' are you not?

Justin.

On Sat, 24 Feb 2007, Jason Rainforest wrote:


I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of
the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:

Of course you could just run repair but then you would never know that
mismatch_cnt was  0.

Justin.

On Sat, 24 Feb 2007, Justin Piszcz wrote:


Perhaps,

The way it works (I believe is as follows)

1. echo check  sync_action
2. If mismatch_cnt  0 then run:
3. echo repair  sync_action
4. Re-run #1
5. Check to make sure it is back to 0.

Justin.

On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:


I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

--
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Michael Tokarev
Jason Rainforest wrote:
 I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
 multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
 +).
 
 I then ordered a resync. The mismatch_cnt returned to 0 at the start of

As pointed out later it was repair, not resync.

 the resync, but around the same time that it went up to 8 with the
 check, it went up to 8 in the resync. After the resync, it still is 8. I
 haven't ordered a check since the resync completed.

As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz



On Sat, 24 Feb 2007, Michael Tokarev wrote:


Jason Rainforest wrote:

I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of


As pointed out later it was repair, not resync.


the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt



That is what I thought too (I will have to wait until I get another 
mismatch to verify), but FYI--


Yesterday I had 512 mismatches for my swap partition (RAID1) after I ran 
the check.


I ran repair.

I catted the mismatch_cnt again, still 512.

I re-ran the check, back to 0.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-24 Thread Chris Wedgwood
On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote:

 And having a way of making this list available to both the
 filesystem and to a userspace utility, so they can more easily deal
 with doing a forced rewrite of the bad sector, after determining
 which file is involved and perhaps doing something intelligent (up
 to and including automatically requesting a backup system to fetch a
 backup version of the file, and if it can be determined that the
 file shouldn't have been changed since the last backup,
 automatically fixing up the corrupted data block :-).

i had a small c program + perl script that would take a badblocks list
and figure out which files on an xfs filesystem were trashed, though
in the case of xfs it's somewhat easier because you can dump the
extents for a file something more generic wouldn't be hard to make
work, it also wouldn't be hard to extend this to inodes in some cases
though im not sure that there is much you can do there beyond fsck

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-24 Thread Mark Hahn

In contrast, ever since these holes appeared, drive failures became the norm.


wow, great conspiracy theory!  maybe the hole is plugged at 
the factory with a substance which evaporates at 1/warranty-period ;)


seriously, isn't it easy to imagine a bladder-like arrangement that 
permits equilibration without net flow?  disk spec-sheets do limit

this - I checked the seagate 7200.10: 10k feet operating, 40k max.
amusingly -200 feet is the min either way...


   Doe anyone rememnber that you had to let you drives acclimate to your
machine room for a day or so before you used them.


The problem is, that's not enough; the room temperature/humidity has to be
controlled too.  In a desktop environment, that's not really feasible.


5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
to me.  in fact, I frequently ask people to justify the assumption that 
a good machineroom needs tight control over humidity.  (assuming, like 
most machinerooms, you aren't frequently handling the innards.)

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html