Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-26 Thread Mogens Kjaer
On 10/25/2009 07:33 PM, Devin Reade wrote:
...
 WARNING: mismatch_cnt is not 0 on /dev/md0

I have two machines with software RAID 1 running CentOS,
they both gave this message this weekend.

Mogens
-- 
Mogens Kjaer, Carlsberg A/S, Computer Department
Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark
Phone: +45 33 27 53 25, Mobile: +45 22 12 53 25
Email: m...@crc.dk Homepage: http://www.crc.dk
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-26 Thread Ryan Wagoner
The /etc/cron.weekly/99-raid-check script is new for 5.4. Read through
the mdadm list. You will find that small mismatch counts on RAID 1 is
normal. I don't remember the exact reason but it has to do with
aborted writes where the queue has already committed the one drive and
not the other. Since it is in an unused area of the file system and
mdadm can't tell when the aborted write happened it is just left
alone. This is why it is common on swap partitions.

Ryan

On Mon, Oct 26, 2009 at 2:48 AM, Mogens Kjaer m...@crc.dk wrote:
 On 10/25/2009 07:33 PM, Devin Reade wrote:
 ...
         WARNING: mismatch_cnt is not 0 on /dev/md0

 I have two machines with software RAID 1 running CentOS,
 they both gave this message this weekend.

 Mogens
 --
 Mogens Kjaer, Carlsberg A/S, Computer Department
 Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark
 Phone: +45 33 27 53 25, Mobile: +45 22 12 53 25
 Email: m...@crc.dk Homepage: http://www.crc.dk
 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-25 Thread RedShift
Devin Reade wrote:
 Saturday I did an upgrade from 5.3 (original install) to 5.4.  Saturday
 night, /etc/cron.weekly reported the following:
 
/etc/cron.weekly/99-raid-check:
 
WARNING: mismatch_cnt is not 0 on /dev/md0
 
 md0 holds /boot and resides, mirrored, on sda1 and sdb1. md1 holds
 an LVM volume containing the remaining filesytems, including swap.
 
 The underlying hardware is just a few months hold, has passed the
 usual memtest stuff, and has been running 5.3 well for a few months.
 
 I'm *guessing* that due to the timing, this is related to the upgrade.
 I have to admit that I forgot myself and instead of doing the glibc
 updates as recommended, I only did:
 
   yum clean all
   yum update yum
   rpm -e --nodeps perl-5.8.8-18.el5_3.1.i386
   (see today's perl thread)
   yum update perl.x86_64
   yum update
   shutdown -r now
 
 I've taken a backup of /boot dump after the upgrade, but have not yet
 reenabled normal backups.
 
 My hunch is that something in the upgrade process touched sda1 but not
 sdb1, and that removing sdb1 from the mirror and reattaching it for 
 resync would be sufficient, however I was looking for comments on this
 from anyone with experience or opinion on the matter.  Googling the
 issue doesn't seem to turn up any recent related results.
 
 Also, could the upgrade have touched the bootblock on sda1 but not 
 sdb1 and thus trigger this problem?
 
 Devin

What exactly is the mismatch_cnt value? If it's not too much, it is most likely 
coming from your swap partition.

Run a check, if that doesn't fail I wouldn't worry about it.


Glenn

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-25 Thread Ron Loftin

On Sun, 2009-10-25 at 12:33 -0600, Devin Reade wrote:
 Saturday I did an upgrade from 5.3 (original install) to 5.4.  Saturday
 night, /etc/cron.weekly reported the following:
 
/etc/cron.weekly/99-raid-check:
 
WARNING: mismatch_cnt is not 0 on /dev/md0
 
I had this happen on a box that I upgraded Friday.  I went ahead and
tested each partition in the affected mirror with badblocks ( found no
errors ) and after multiple resyncs, there was no change.  After similar
experiences with Google, I did run across a note saying that this went
away after a reboot.  I broke down and applied the Micro$lop solution
( reboot ) and the error has gone away.

Like you, I'm interested in a better understanding of this issue, so if
anyone else has more info, I'm all ears. ;

 md0 holds /boot and resides, mirrored, on sda1 and sdb1. md1 holds
 an LVM volume containing the remaining filesytems, including swap.
 
 The underlying hardware is just a few months hold, has passed the
 usual memtest stuff, and has been running 5.3 well for a few months.
 
 I'm *guessing* that due to the timing, this is related to the upgrade.
 I have to admit that I forgot myself and instead of doing the glibc
 updates as recommended, I only did:
 
   yum clean all
   yum update yum
   rpm -e --nodeps perl-5.8.8-18.el5_3.1.i386
   (see today's perl thread)
   yum update perl.x86_64
   yum update
   shutdown -r now
 
 I've taken a backup of /boot dump after the upgrade, but have not yet
 reenabled normal backups.
 
 My hunch is that something in the upgrade process touched sda1 but not
 sdb1, and that removing sdb1 from the mirror and reattaching it for 
 resync would be sufficient, however I was looking for comments on this
 from anyone with experience or opinion on the matter.  Googling the
 issue doesn't seem to turn up any recent related results.
 
 Also, could the upgrade have touched the bootblock on sda1 but not 
 sdb1 and thus trigger this problem?
 
 Devin
-- 
Ron Loftin  relof...@twcny.rr.com

God, root, what is difference ?   Piter from UserFriendly

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-25 Thread S.Tindall
On Sun, 2009-10-25 at 14:52 -0400, Ron Loftin wrote:
 On Sun, 2009-10-25 at 12:33 -0600, Devin Reade wrote:
  Saturday I did an upgrade from 5.3 (original install) to 5.4.  Saturday
  night, /etc/cron.weekly reported the following:
  
 /etc/cron.weekly/99-raid-check:
  
 WARNING: mismatch_cnt is not 0 on /dev/md0
  
 I had this happen on a box that I upgraded Friday.  I went ahead and
 tested each partition in the affected mirror with badblocks ( found no
 errors ) and after multiple resyncs, there was no change.  After similar
 experiences with Google, I did run across a note saying that this went
 away after a reboot.  I broke down and applied the Micro$lop solution
 ( reboot ) and the error has gone away.
 
 Like you, I'm interested in a better understanding of this issue, so if
 anyone else has more info, I'm all ears. ;
 

mismatch_cnt (/sys/block/md*/md/mismatch_cnt) is the number of
unsynchronized blocks in the raid.

The repair is to rebuild the raid:

# echo repair /sys/block/md#/md/sync_action

...which does not reset the count, but if you force a check after the
rebuild is complete:

# echo check /sys/block/md#/md/sync_action

...then the count should return to zero.

Or at least that worked for me on three systems today.

Steve



___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-25 Thread Devin Reade
RedShift redsh...@pandora.be wrote:

 What exactly is the mismatch_cnt value? If it's not too much, it is most 
 likely coming from your swap partition.

128.  md0 is /boot only; swap is on md1 which didn't have a problem

Devin
-- 
A zygote is a gamete's way of producing more gametes.  This may be the
purpose of the universe.- Robert Heinlein

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mismatch_cnt after 5.3 - 5.4 upgrade

2009-10-25 Thread Devin Reade
S.Tindall tindall.sat...@brandxmail.com wrote:

 mismatch_cnt (/sys/block/md*/md/mismatch_cnt) is the number of
 unsynchronized blocks in the raid.

Understood.  

I did the repair/check on sync_action and it got rid of the problem. (Thanks)

What I _don't_ understand is why they were unsynchronized to begin with
(`cat /proc/mdstat` showed the array to be clean). Nor do I understand
the mechanism used by the 'repair' mechanism, and why I should believe
that it's using the correct data in its sync.  Although I've looked around,
I've not seen anything that describes how repair works and (specifically
for raid1) how it can tell which slice has the good data and which has the
bad data.

Fixing things without understanding what is going on under the covers
(at least conceptually) does not give me a warm fuzzy feeling :/

Devin
-- 
A zygote is a gamete's way of producing more gametes.  This may be the
purpose of the universe.- Robert Heinlein

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos