Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Christian Pernegger

Sorry to hijack the thread a little but I just noticed that the
mismatch_cnt for my mirror is at 256.

I'd always thought the monthly check done by the mdadm Debian package
does repair as well - apparently it doesn't.

So I guess I should run repair but I'm wondering ...
- is it safe / bugfree considering my oldish software? (mdadm 2.5.2 +
linux 2.6.17.4)
- is there any way to check which files (if any) have been corrupted?
- I have grub installed by hand on both mirror components, but that
shouldn't show up as mismatch, should it?

The box in question is in production so I'd rather not update mdadm
and/or kernel if possible.

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Bill Davidsen

Justin Piszcz wrote:



On Sat, 24 Feb 2007, Michael Tokarev wrote:


Jason Rainforest wrote:

I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of


As pointed out later it was repair, not resync.


the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 
8. I

haven't ordered a check since the resync completed.


As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt



That is what I thought too (I will have to wait until I get another 
mismatch to verify), but FYI--


Yesterday I had 512 mismatches for my swap partition (RAID1) after I 
ran the check.


I ran repair.

I catted the mismatch_cnt again, still 512.

I re-ran the check, back to 0. 


AFAIK the repair action will give you a count of the repairs it does, 
and will fail a drive if a read does not succeed after the sector is 
rewritten. That's the way I read it, and the way it seems to work.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Justin Piszcz



On Sun, 25 Feb 2007, Christian Pernegger wrote:


Sorry to hijack the thread a little but I just noticed that the
mismatch_cnt for my mirror is at 256.

I'd always thought the monthly check done by the mdadm Debian package
does repair as well - apparently it doesn't.

So I guess I should run repair but I'm wondering ...
- is it safe / bugfree considering my oldish software? (mdadm 2.5.2 +
linux 2.6.17.4)
- is there any way to check which files (if any) have been corrupted?
- I have grub installed by hand on both mirror components, but that
shouldn't show up as mismatch, should it?

The box in question is in production so I'd rather not update mdadm
and/or kernel if possible.

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



That is a very good question.. Also I hope you are not running XFS with 
2.6.17.4.  (corruption bug)


Besides that, I wonder if it would be possible (with bitmaps perhaps(?)) 
to have the kernel increment that and then post via ring buffer/dmesg, 
something like:


kernel: md1: mismatch_cnt: 512, file corrupted: /etc/resolv.conf

I would take a performance hit for something like that :)

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Neil Brown
On Saturday February 24, [EMAIL PROTECTED] wrote:
 But is this not a good opportunity to repair the bad stripe for a very
 low cost (no complete resync required)?

In this case, 'md' knew nothing about an error.  The SCSI layer
detected something and thought it had fixed it itself.  Nothing for md
to do.

 
 At time of error we actually know which disk failed and can re-write
 it, something we do not know at resync time, so I assume we always
 write to the parity disk.

md only knows of a 'problem' if the lower level driver reports one.
If it reports a problem for a write request, md will fail the device.
If it reports a problem for a read request, md will try to over-write
correct data on the failed block. 
But if the driver doesn't report the failure, there is nothing md can
do.

When performing a check/repair md looks for consistencies and fixes
the 'arbitrarily'.  For raid5/6, it just 'corrects' the parity.  For
raid1/10, it chooses one block and over-writes the other(s) with it.

Mapping these corrections back to blocks in files in the filesystem is
extremely non-trivial.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Jeff Breidenbach

Ok, so hearing all the excitement I ran a check on a multi-disk
RAID-1. One of the RAID-1 disks failed out, maybe by coincidence
but presumably due to the check. (I also have another disk in
the array deliberately removed as a backup mechanism.) And
of course there is a big mismatch count.

Questions: will repair do the right thing for multidisk RAID-1, e.g.
vote or something? Do I need a special version of mdadm to
do this safely? What am I forgetting to ask?

Jeff


# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdf1[0] sdb1[4] sdd1[6](F) sdc1[2] sde1[1]
 488383936 blocks [6/4] [UUU_U_]

# cat /sys/block/md1/md/mismatch_cnt
128

# cat /proc/version
Linux version 2.6.17-2-amd64 (Debian 2.6.17-7) ([EMAIL PROTECTED]) (gcc
version 4.1.2 20060814 (prerelease) (Debian 4.1.1-11)) #1 SMP Thu Aug
24 16:13:57 UTC 2006

# dpkg -l | grep  mdadm
ii  mdadm  1.9.0-4sarge1  Manage MD devices aka Linux Software Raid
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
Of course you could just run repair but then you would never know that 
mismatch_cnt was  0.


Justin.

On Sat, 24 Feb 2007, Justin Piszcz wrote:


Perhaps,

The way it works (I believe is as follows)

1. echo check  sync_action
2. If mismatch_cnt  0 then run:
3. echo repair  sync_action
4. Re-run #1
5. Check to make sure it is back to 0.

Justin.

On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:


I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

--
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Jason Rainforest
I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of
the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:
 Of course you could just run repair but then you would never know that 
 mismatch_cnt was  0.
 
 Justin.
 
 On Sat, 24 Feb 2007, Justin Piszcz wrote:
 
  Perhaps,
 
  The way it works (I believe is as follows)
 
  1. echo check  sync_action
  2. If mismatch_cnt  0 then run:
  3. echo repair  sync_action
  4. Re-run #1
  5. Check to make sure it is back to 0.
 
  Justin.
 
  On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:
 
  I did a resync since, which ended up with the same mismatch_cnt of 184.
  I noticed that the count *was* reset to zero when the resync started,
  but ended up with 184 (same as after the check).
  
  I thought that the resync just calculates fresh parity and does not
  bother checking if it is different. So what does this final count mean?
  
  This leads me to ask: why bother doing a check if I will always run
  a resync after an error - better run a resync in the first place?
  
  -- 
  Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
 attach .zip as .dat
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz

A resync?  You're supposed to run a 'repair' are you not?

Justin.

On Sat, 24 Feb 2007, Jason Rainforest wrote:


I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of
the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:

Of course you could just run repair but then you would never know that
mismatch_cnt was  0.

Justin.

On Sat, 24 Feb 2007, Justin Piszcz wrote:


Perhaps,

The way it works (I believe is as follows)

1. echo check  sync_action
2. If mismatch_cnt  0 then run:
3. echo repair  sync_action
4. Re-run #1
5. Check to make sure it is back to 0.

Justin.

On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:


I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

--
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Jason Rainforest
Yes, I meant repair, sorry. I checked my bash history and I did indeed
order a repair (echo repair /sys/block/md0/md/sync_action). I think I
called it a resync because that's what /proc/mdstat told me it was
doing.

On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote:
 A resync?  You're supposed to run a 'repair' are you not?
 
 Justin.
 
 On Sat, 24 Feb 2007, Jason Rainforest wrote:
 
  I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
  multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
  +).
 
  I then ordered a resync. The mismatch_cnt returned to 0 at the start of
  the resync, but around the same time that it went up to 8 with the
  check, it went up to 8 in the resync. After the resync, it still is 8. I
  haven't ordered a check since the resync completed.
 
 
  On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:
  Of course you could just run repair but then you would never know that
  mismatch_cnt was  0.
 
  Justin.
 
  On Sat, 24 Feb 2007, Justin Piszcz wrote:
 
  Perhaps,
 
  The way it works (I believe is as follows)
 
  1. echo check  sync_action
  2. If mismatch_cnt  0 then run:
  3. echo repair  sync_action
  4. Re-run #1
  5. Check to make sure it is back to 0.
 
  Justin.
 
  On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:
 
  I did a resync since, which ended up with the same mismatch_cnt of 184.
  I noticed that the count *was* reset to zero when the resync started,
  but ended up with 184 (same as after the check).
 
  I thought that the resync just calculates fresh parity and does not
  bother checking if it is different. So what does this final count mean?
 
  This leads me to ask: why bother doing a check if I will always run
  a resync after an error - better run a resync in the first place?
 
  --
  Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
   attach .zip as .dat
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz

Ahh, perhaps Neil can fix that? ;)

Cat /sys/block/md0/md/sync_action will tell you what it is really doing.


On Sat, 24 Feb 2007, Jason Rainforest wrote:


Yes, I meant repair, sorry. I checked my bash history and I did indeed
order a repair (echo repair /sys/block/md0/md/sync_action). I think I
called it a resync because that's what /proc/mdstat told me it was
doing.

On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote:

A resync?  You're supposed to run a 'repair' are you not?

Justin.

On Sat, 24 Feb 2007, Jason Rainforest wrote:


I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of
the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


On Sat, 2007-02-24 at 04:37 -0500, Justin Piszcz wrote:

Of course you could just run repair but then you would never know that
mismatch_cnt was  0.

Justin.

On Sat, 24 Feb 2007, Justin Piszcz wrote:


Perhaps,

The way it works (I believe is as follows)

1. echo check  sync_action
2. If mismatch_cnt  0 then run:
3. echo repair  sync_action
4. Re-run #1
5. Check to make sure it is back to 0.

Justin.

On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:


I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

--
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Michael Tokarev
Jason Rainforest wrote:
 I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
 multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
 +).
 
 I then ordered a resync. The mismatch_cnt returned to 0 at the start of

As pointed out later it was repair, not resync.

 the resync, but around the same time that it went up to 8 with the
 check, it went up to 8 in the resync. After the resync, it still is 8. I
 haven't ordered a check since the resync completed.

As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz



On Sat, 24 Feb 2007, Michael Tokarev wrote:


Jason Rainforest wrote:

I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of


As pointed out later it was repair, not resync.


the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 8. I
haven't ordered a check since the resync completed.


As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt



That is what I thought too (I will have to wait until I get another 
mismatch to verify), but FYI--


Yesterday I had 512 mismatches for my swap partition (RAID1) after I ran 
the check.


I ran repair.

I catted the mismatch_cnt again, still 512.

I re-ran the check, back to 0.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Eyal Lebedinsky
I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

-- 
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html