Re: SMART diags (was: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-25 Thread Tom Buskey
On Wed, Feb 24, 2010 at 10:03 PM, Benjamin Scott dragonh...@gmail.comwrote:

 On Wed, Feb 24, 2010 at 8:08 AM, Tom Buskey t...@buskey.name wrote:
  They found little difference between enterprise and consumer grade
 lifetimes.

   That doesn't surprise me.  They're often the exact same hard disk
 assembly, just with different firmware, or maybe a different PCB.
 Despire marketing claims to the contrary, I wouldn't expect firmware
 tweaks to significantly improve reliability in most cases.  (*Most*.
 Reportedly, bad design in the firmware of IBM's DeskStar drives may
 have causes some of their problems back during the DeathStar
 plague.)


I hit a firmware problem migrating disks from an older to newer NetApps.
The old system bolted the drives in the case (hey, it was a 486, not hot
swap).  The new system used DEC StorageWorks and we had to put the drives in
plastic carriers.

Apparently, the drives could get into sync  harmonic motion with the new
setup.  We had 3-4 drive failures in one week.  The new drives had different
firmware that prevented this.  And NetApp no longer allowed you to put your
own drives in the carriers.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: SMART diags (was: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-25 Thread Benjamin Scott
On Thu, Feb 25, 2010 at 8:48 AM, Tom Buskey t...@buskey.name wrote:
 Apparently, the drives could get into sync  harmonic motion with the new
 setup.  We had 3-4 drive failures in one week.  The new drives had different
 firmware that prevented this.

  Wow.  That's a great anecdote.  This is why engineers dread when
someone requests just a tiny little change.  Then something like
*that* happens!  :)

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: SMART diags (was: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-24 Thread Tom Buskey
On Tue, Feb 23, 2010 at 7:43 PM, Benjamin Scott dragonh...@gmail.comwrote:

 On Tue, Feb 23, 2010 at 6:05 PM, Ken D'Ambrosio k...@jots.org wrote:
  Huh -- I actually *have* had SMART tell me things were awry, several
  times.

   Well, that's good to know.  :)

  Just curious, did you get a chance to see if any of them actually
 started failing soon after?

  Like I said, I did have one case where SMART said something was
 wrong, but nobody could figure out why it was saying that, and they
 only did an exchange because I insisted.  And, of course, since it was
 a service contract, I couldn't keep the old part to see if/when it
 would actually start showing other symptoms.


Google released a study of hard drive failures (last year?).  Another
organization (CERN?) release one at about the same time.

It said SMART is 50/50 and not all that reliable as a defect predictor.
They found little difference between enterprise and consumer grade
lifetimes.  Once drives have errors, they multiply quickly.

It is definitely worth digging up.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: SMART diags (was: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-24 Thread Benjamin Scott
On Wed, Feb 24, 2010 at 8:08 AM, Tom Buskey t...@buskey.name wrote:
 They found little difference between enterprise and consumer grade lifetimes.

  That doesn't surprise me.  They're often the exact same hard disk
assembly, just with different firmware, or maybe a different PCB.
Despire marketing claims to the contrary, I wouldn't expect firmware
tweaks to significantly improve reliability in most cases.  (*Most*.
Reportedly, bad design in the firmware of IBM's DeskStar drives may
have causes some of their problems back during the DeathStar
plague.)

 Once drives have errors, they multiply quickly.

  That much is explained by received wisdom: Modern hard drives are
designed with a certain amount of redundency.  They use ECC on a
block-by-block basis (helping recover from single-bit errors
on-the-fly), and they have a certain number of spare blocks.  For I/O
to start being failed to the OS, the problems have to have overwhelmed
the drive's internal mechanisms.  (For example, maybe the spare blocks
are all used up.)  Glitches that previously could be compensated for
instead now yield I/O errors.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-24 Thread Benjamin Scott
P.S.:

On Tue, Feb 23, 2010 at 9:32 PM, Michael Bilow
mik...@colossus.bilow.com wrote:
 At this point, an unreadable block encountered on a block device is
 handled at a very high level, usually the file system, well above
 where things like AWRE on the hardware can occur.

  Heck, it's not even handled by the filesystem.  It usually goes
something like this: HDD returns error to the controller, controller
driver returns error to the block device layer, block layer returns
error to filesystem, filesystem returns error to C library, C library
returns error to application, application pukes on its shoes, sysadmin
gets a call at 3 AM saying the server is down.  ;-)

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Tom Buskey
On Mon, Feb 22, 2010 at 3:14 PM, Benjamin Scott dragonh...@gmail.comwrote:

 On Mon, Feb 22, 2010 at 1:39 PM, Michael ODonnell
 michael.odonn...@comcast.net wrote:
  So far, then, it's looking like every Sunday at 4:22 all the RAIDs
  (all types or just RAID1?) in standard x86_64 CentOS5.4 (and RHAT?)
  boxes are broken and then resync'd.


FWIW, I don't see this in my logs (back to 1/24/2010, not far) on Fedora 12.


   All types (as I interpret the script source).

  If the documentation is to be believed, they are not being broken;
 they are being checked for consistency.  Not the same thing.  Breaking
 and rebuilding leaves the array vulnerable during the rebuild, as you
 note.  A consistency check just compares the supposedly identical
 members to confirm they really *are* identically, and warns you if
 they are not.





  With a good RAID implementation, I/O for patrol reads is done when
 the array is idle.  (Kind of like nice 19 for I/O.)  I don't know if
 Linux does this or not.


The correct terminology is a scrub.  I think most RAID systems can do it.
It's like a fsck - something that checks the RAID structures and data to
find inconsistancies so they can be dealt with.

Scrubs can be done live and are a good thing do.  They take IO and time.
ZFS doesn't have fsck, but it does do scrubs.  Hardware RAID can do scrubs
as well.

From man zpool on Solaris:

Scrubbing and resilvering are very  similar  operations.
 The  difference  is  that resilvering only examines data
 that ZFS knows to be out  of  date  (for  example,  when
 attaching  a  new  device  to  a  mirror or replacing an
 existing device), whereas scrubbing examines all data to
 discover  silent  errors  due to hardware faults or disk
 operations, ZFS only allows one at a time. If a scrub is
 already in progress,  the  zpool  scrub  command  ter-
 minates  it  and starts a new scrub. If a resilver is in
 progress, ZFS does not allow a scrub to be started until
 the resilver completes.


How often to do them gets debated in the forums.  Times vary with activity,
hardware, size.  My consumer grade RAIDZ of 4 500GB SATA II takes an hour.

Your RAID system will be different.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Benjamin Scott
On Tue, Feb 23, 2010 at 9:40 AM, Tom Buskey t...@buskey.name wrote:
  ... patrol reads ...

 The correct terminology is a scrub.

  Dell and LSI Logic call it patrol read.

  I believe I've seen Adaptec call it consistency check, although
that was a long time ago.

  What makes your terminology more correct then them?  :-)

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Michael ODonnell


I executed commands as they would have been during the cron.weekly run and
I can now see why our simple monitor script would conclude the RAID had
a problem based on the resultant contents of /proc/mdstat.  During the
check operation the RAID state is described as clean, resyncing
by mdadm and I don't know whether the RAID should be regarded as being
fault-tolerant in that state, though Mr.  Bilow indicated that it should
and I see no screaming evidence to the contrary.

Before the check operation:

   ### ROOT ### cbc1:~ 545--- cat /sys/block/md0/md/array_state
   clean

   ### ROOT ### cbc1:~ 546--- cat /sys/block/md0/md/sync_action
   idle

   ### ROOT ### cbc1:~ 547--- cat /proc/mdstat
   Personalities : [raid1]
   md0 : active raid1 sdb5[1] sda5[0]
 951409344 blocks [2/2] [UU]

   unused devices: none

Trigger the check:

   ### ROOT ### cbc1:~ 548--- echo check  /sys/block/md0/md/sync_action

After the check:

   ### ROOT ### cbc1:~ 549--- cat /sys/block/md0/md/array_state
   clean

   ### ROOT ### cbc1:~ 550--- cat /sys/block/md0/md/sync_action
   check

   ### ROOT ### cbc1:~ 551--- cat /proc/mdstat
   Personalities : [raid1]
   md0 : active raid1 sdb5[1] sda5[0]
 951409344 blocks [2/2] [UU]
 []  resync =  0.1% (958592/951409344) 
finish=132.1min speed=119824K/sec

   unused devices: none

   ### ROOT ### cbc1:~ 552--- mdadm --query --detail /dev/md0
   /dev/md0:
   Version : 0.90
 Creation Time : Fri Jan 22 11:08:38 2010
Raid Level : raid1
Array Size : 951409344 (907.33 GiB 974.24 GB)
 Used Dev Size : 951409344 (907.33 GiB 974.24 GB)
  Raid Devices : 2
 Total Devices : 2
   Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Tue Feb 23 09:42:14 2010
 State : clean, resyncing
Active Devices : 2
   Working Devices : 2
Failed Devices : 0
 Spare Devices : 0

Rebuild Status : 0% complete

  UUID : daf8dd0b:00087a40:d5caa7ee:ae05b3aa
Events : 0.56

   Number   Major   Minor   RaidDevice State
  0   850  active sync   /dev/sda5
  1   8   211  active sync   /dev/sdb5

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Tom Buskey
On Tue, Feb 23, 2010 at 9:53 AM, Benjamin Scott dragonh...@gmail.comwrote:

 On Tue, Feb 23, 2010 at 9:40 AM, Tom Buskey t...@buskey.name wrote:
   ... patrol reads ...
 
  The correct terminology is a scrub.

   Dell and LSI Logic call it patrol read.

  I believe I've seen Adaptec call it consistency check, although
 that was a long time ago.

  What makes your terminology more correct then them?  :-)


You've got me there.

I remember talking about  scheduling RAID scrubs in 1996 for a Micropolis
RAIDON SCSI raid and on a NetApp.  I think DEC Storageworks talked about
RAID scrubs for their systems in 1995.

Dell has some documentation (c) 1999 referring to scrubs on RAID
controllers.

This article on chosing a storage system:
http://www.information-management.com/issues/19971101/928-1.html from 1997
says to look for a storage unit featuring disk and scrubbing.

Scrubbing is also mentioned for ECC memory functions and I'd assume that's
where the RAID scrub comes from.

I think this is more of a jargon issue with scrub is generic for the brand
specific marketing names.  Like some network admins saying rowter and others
saying rooter for router.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Michael ODonnell


In finest NIH form we could deal with the scrubber/patrol terminology
question by inventing a new acronym.  How about GRIDLEBYRF for
Gratuitous Reads Intended to Detect Latent Errors Before You're Royally
Fscked ?  FWIW, back around 2003 I wrote such logic for an early release
of MD on Red Hat 9 and we called it a scrubber, though I'm not sure who
came up with that term or why...
 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Tom Buskey
On Tue, Feb 23, 2010 at 11:51 AM, Michael ODonnell 
michael.odonn...@comcast.net wrote:


  FWIW, back around 2003 I wrote such logic for an early release
 of MD on Red Hat 9 and we called it a scrubber, though I'm not sure who
 came up with that term or why...


Probably because it was common knowledge in the peer group.  Why do we
call a quick fix a kludge, hack or duct taped?

Gone Haywire can be traced to lumbermen beleive it or not.

They had axes, saws and horses to harvest lumber in the woods.  Hay was
delivered to feed the horses.  It was baled with wire, so there was always
lots of scrap wire around.  Inevitably when something broke, hay wire was
used to fix it.  If you saw wire wrapped axe handles, hinges bound with it,
etc, etc it was a sign the camp was going downhill and about to pull up
stakes or go bankrupt.  I suppose now we might say gone duct tape.  Or a
rusty car gone bondo.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Bill McGonigle
On 02/22/2010 06:28 PM, Benjamin Scott wrote:
However, looking at the difference above, I think they're different
 *in the wrong way*.  It looks like one of the disks specifies
 (hd0,0)/grub/grub.conf while the other just specifies
 /grub/grub.conf.  That doesn't seem right.

Looking at one of my CentOS 5.4 boxes, /boot on raid-1, viewed through the md 
driver, stage2 is using:

  (hd0,0)/grub/grub.conf

0001020  \0  \0   0   .   9   7  \0   (   h   d   0   ,   0   )   /   g
0001040   r   u   b   /   g   r   u   b   .   c   o   n   f  \0  \0  \0

Then taking stage2 from both mirror halves, cmp returns no differences.

Although I*did*  test booting from either disk before deploying this
 system.  Although again... there was a lot going on though at the
 time.  Maybe I screwed up my tests, too?  Hmmm.

The theory fits the data, though my understanding of grub innards is incomplete.

I'm assuming sda maps to the first BIOS drive and so has been the working one 
on boot?  I'm not sure what happens if you issue a 'repair' - if the RAID 
headers are both showing the same mtime, which would it pick?  Probably better 
to --fail out sdb1, touch something on the array, and add it back in.

If GRUB was installed from Linux, against the RAID device (/dev/md0
 or whatever), then I would think the two members should be identical,
 because the RAID driver in the kernel would have written the same
 blocks to both members.

I'm not sure, I think I've seen the grub-install script detect the BIOS drives 
of an md array.  Again, I don't really know how it works.

-Bill

-- 
Bill McGonigle, Owner   
BFC Computing, LLC   
http://bfccomputing.com/ 
Telephone: +1.603.448.4440
Email, IM, VOIP: b...@bfccomputing.com   
VCard: http://bfccomputing.com/vcard/bill.vcf
Social networks: bill_mcgonigle/bill.mcgonigle
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Michael Bilow
Commanding check to the md device is ordinarily a read-only 
operation, despite the terminology in the log that says resyncing.

During the md check operation, the array is clean (not degraded) 
and you can see that explicitly with the [UU] status report; if 
the array were degraded the failed device would be marked with an 
underscore (that is, array status would be [U_] or [_U]).

It is not a scrub because it does not attempt to repair anything. 
In ancient days, it was necessary to refresh data periodically by 
reading it and rewriting it to make sure it was not decaying due to 
changes in temperature, head position, dimensional stability, and so 
forth. The term comes from Middle English, where scrub means to 
remove impurities and is etymologically related to scrape; the 
original use of the term in computing is for core memory from which 
it was later applied to dynamic RAM and eventually to disks.

If a hardware read error is encountered during the check, the md 
driver handles this in the same way as a hardware read error that is 
encountered at any other time. Depending upon the RAID mode, it may 
attempt to reconstruct the failed sector and write it, possibly 
triggering the physical drive to reallocate a spare sector. More 
commonly, the md device will mark the physical drive as failed and 
degrade the array. Detecting and reporting soft failure incidents 
such as reallocations of spare sectors is the job of something like 
smartmontools, which can and should be configured to look past the 
md device and monitor the physical drives that are its components.

This consistency check is not strictly guaranteed to be read-only 
because it can trigger the array to drop to degraded mode depending 
upon what is encountered, but this (as far as I know) only occurs 
when there is some underlying hardware problem beyond merely 
different data. If the array is so configured, such as having a 
hot-spare device on-line, then the degradation incident can trigger 
writing operations.

-- Mike


On 2010-02-23 at 10:12 -0500, Michael ODonnell wrote:



 I executed commands as they would have been during the cron.weekly run and
 I can now see why our simple monitor script would conclude the RAID had
 a problem based on the resultant contents of /proc/mdstat.  During the
 check operation the RAID state is described as clean, resyncing
 by mdadm and I don't know whether the RAID should be regarded as being
 fault-tolerant in that state, though Mr.  Bilow indicated that it should
 and I see no screaming evidence to the contrary.

 Before the check operation:

   ### ROOT ### cbc1:~ 545--- cat /sys/block/md0/md/array_state
   clean

   ### ROOT ### cbc1:~ 546--- cat /sys/block/md0/md/sync_action
   idle

   ### ROOT ### cbc1:~ 547--- cat /proc/mdstat
   Personalities : [raid1]
   md0 : active raid1 sdb5[1] sda5[0]
 951409344 blocks [2/2] [UU]

   unused devices: none

 Trigger the check:

   ### ROOT ### cbc1:~ 548--- echo check  /sys/block/md0/md/sync_action

 After the check:

   ### ROOT ### cbc1:~ 549--- cat /sys/block/md0/md/array_state
   clean

   ### ROOT ### cbc1:~ 550--- cat /sys/block/md0/md/sync_action
   check

   ### ROOT ### cbc1:~ 551--- cat /proc/mdstat
   Personalities : [raid1]
   md0 : active raid1 sdb5[1] sda5[0]
 951409344 blocks [2/2] [UU]
 []  resync =  0.1% (958592/951409344) 
 finish=132.1min speed=119824K/sec

   unused devices: none

   ### ROOT ### cbc1:~ 552--- mdadm --query --detail /dev/md0
   /dev/md0:
   Version : 0.90
 Creation Time : Fri Jan 22 11:08:38 2010
Raid Level : raid1
Array Size : 951409344 (907.33 GiB 974.24 GB)
 Used Dev Size : 951409344 (907.33 GiB 974.24 GB)
  Raid Devices : 2
 Total Devices : 2
   Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Tue Feb 23 09:42:14 2010
 State : clean, resyncing
Active Devices : 2
   Working Devices : 2
Failed Devices : 0
 Spare Devices : 0

Rebuild Status : 0% complete

  UUID : daf8dd0b:00087a40:d5caa7ee:ae05b3aa
Events : 0.56

   Number   Major   Minor   RaidDevice State
  0   850  active sync   /dev/sda5
  1   8   211  active sync   /dev/sdb5

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-23 Thread Benjamin Scott
On Tue, Feb 23, 2010 at 2:01 PM, Michael Bilow
mik...@colossus.bilow.com wrote:
 During the md check operation, the array is clean (not degraded)
 and you can see that explicitly with the [UU] status report ...

  Of course, mdstat still calls the array clean even after
mismatches are detected, which isn't what I'd usually call clean...
:-)

 It is not a scrub because it does not attempt to repair anything.

  Comments in previously mentioned config file don't make it sound
like that.  A check operation will scan the drives looking for bad
sectors and automatically repairing only bad sectors.  It doesn't
explain how it would repair bad sectors.  Perhaps it means the bad
sectors will be repaired by failing the entire member and having the
sysadmin insert a new disk.  Perhaps the comments are just wrong.

  Not arguing with you, just reporting what the file told me.  Would
the file lie?  ;-)

 Detecting and reporting soft failure incidents
 such as reallocations of spare sectors ...

  The relocation algorithm in modern disks generally works like this
(or so I'm told):

R1. OS requests read logical block from HDD.  HDD tries to read from
block on disk, and can't, even with retries and ECC.  HDD returns
failure to the OS, and marks that physical block as bad and as a
candidate for relocation.

R2. Repeated attempts by OS to read from the same block cause the HDD
to retry.  It won't throw away your data on its own.

R3. OS requests write to same logical block.  HDD relocate to
different physical block, and throws away the bad block.  It can do
that now, since you've told it you don't want the data that was there,
by writing new data over it.

  It would be nice if hard disks were smart enough to detect a block
that was getting marginal and preemptively relocate it.  Last I looked
into this (admittedly, several years ago), they didn't do that.  Maybe
they've gotten smarter about that.  If they haven't gotten smarter, if
the check operation reads all the blocks on the the disk but never
writes, that alone won't trigger relocation of a bad block.  The
check operation would have to read the good block from the other
disk, and attempt to rewrite it to the bad disk.  *That* might trigger
a useful relocation by the HDD with the bad block.

 smartmontools, which can and should be configured to look past the
 md device and monitor the physical drives that are its components.

  While I run smartd in monitor mode, I've never had it give me a
useful pre-failure alert.  Likewise, I've never had the SMART health
check in PC BIOSes give me a useful pre-failure alert.  More than once
I've seen SMART report the overall health check as PASS when the
whole damn disk is unreadable.  It make me wonder just what the
overall SMART health is supposed to indicate -- Yes, the HDD is
physically present?  :)

  I did once have the BIOS check start reporting a SMART health
warning, but all the OEM diagnostics, smartctl, badblocks -w, etc.,
didn't actually report anything wrong.  The reseller replaced the
drive at my insistence.  Maybe the SMART health check knew something
that none of the other SMART parameters were reporting.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


SMART diags (was: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-23 Thread Ken D'Ambrosio
On Tue, February 23, 2010 5:43 pm, Benjamin Scott wrote:

 While I run smartd in monitor mode, I've never had it give me a
 useful pre-failure alert.  Likewise, I've never had the SMART health check
 in PC BIOSes give me a useful pre-failure alert.  More than once I've seen
 SMART report the overall health check as PASS when the
 whole damn disk is unreadable.  It make me wonder just what the overall
 SMART health is supposed to indicate -- Yes, the HDD is
 physically present?  :)

Huh -- I actually *have* had SMART tell me things were awry, several
times.  Mind you -- not nearly as often as I've had disks die w/o being
proactively informed, but I'd probably put the ratio as high as 25%+.

I guess YMMV is more than just a catchphrase, after all...

-Ken


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: SMART diags (was: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-23 Thread Benjamin Scott
On Tue, Feb 23, 2010 at 6:05 PM, Ken D'Ambrosio k...@jots.org wrote:
 Huh -- I actually *have* had SMART tell me things were awry, several
 times.

  Well, that's good to know.  :)

  Just curious, did you get a chance to see if any of them actually
started failing soon after?

  Like I said, I did have one case where SMART said something was
wrong, but nobody could figure out why it was saying that, and they
only did an exchange because I insisted.  And, of course, since it was
a service contract, I couldn't keep the old part to see if/when it
would actually start showing other symptoms.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/



SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-23 Thread Michael Bilow
On 2010-02-23 at 17:43 -0500, Benjamin Scott wrote:

 On Tue, Feb 23, 2010 at 2:01 PM, Michael Bilow
 mik...@colossus.bilow.com wrote:
 During the md check operation, the array is clean (not degraded)
 and you can see that explicitly with the [UU] status report ...

  Of course, mdstat still calls the array clean even after
 mismatches are detected, which isn't what I'd usually call clean...
 :-)

Ther term clean in this context just means that all of the RAID 
components (physical drives) are still present.

 It is not a scrub because it does not attempt to repair anything.

  Comments in previously mentioned config file don't make it sound
 like that.  A check operation will scan the drives looking for bad
 sectors and automatically repairing only bad sectors.  It doesn't
 explain how it would repair bad sectors.  Perhaps it means the bad
 sectors will be repaired by failing the entire member and having the
 sysadmin insert a new disk.  Perhaps the comments are just wrong.

  Not arguing with you, just reporting what the file told me.  Would
 the file lie?  ;-)

That's sort of true and sort of not true, but generally outdated. It 
is important to appreciate that the md device operates at a level 
of abstraction above block devices that isolates it from low-level 
details that are handled by whatever driver manages the block 
devices. For something like a parallel IDE drive -- or, heaven 
forbid, an ST-506 drive -- there is not a lot of intelligence on 
board the drive that will mask error conditions: a read error is a 
read error.

When SCSI (meaning SCSI-2) was developed, it provided for a ton of 
settable parameters, some vendor-independent and some proprietary. 
Among these were mode page bits that controlled what the device 
would do by default on encountering errors during read or write, 
notably the ARRE (automatic read reallocation) and AWRE 
(automatic write reallocation) bits. Exactly what a device does when 
these bits are asserted is not too well specified, especially 
considering that a disk and a tape may have radically different 
ranges of options but use the same basic SCSI command set. In 
practice, I can't think of any reasonable way to implement ARRE: 
it's almost always worse to return bad data from a read operation 
with a success code than to just have the read operation report a 
failure code outright.

(ATA is essentially a protocol for wrapping SCSI commands and 
responses into packets for non-SCSI devices, so the logic applies.)

 Detecting and reporting soft failure incidents
 such as reallocations of spare sectors ...

  The relocation algorithm in modern disks generally works like this
 (or so I'm told):

 R1. OS requests read logical block from HDD.  HDD tries to read from
 block on disk, and can't, even with retries and ECC.  HDD returns
 failure to the OS, and marks that physical block as bad and as a
 candidate for relocation.

At this point, an unreadable block encountered on a block device is 
handled at a very high level, usually the file system, well above 
where things like AWRE on the hardware can occur. This is where the 
md driver will intervene, attempting to reconstruct the unreadable 
block from its reservoir of redundancy (the other copy if RAID-1, 
the other stripes if RAID-5). If the md driver can reconstruct the 
unreadable data, it will attempt to write the correct data back to 
the block device: at this point, the hardware may reallocate a spare 
sector for the new data. Unless a write occurs somehow, though, even 
with AWRE enabled the hardware should not reallocate a sector.

When a write succeeds and forces an AWRE event, the hardware 
test-reads the newly written data and returns an error if the data 
could not be verified. By this stage, the md device may have had 
cause to mark the whole block device as bad and degrade the array.

 R2. Repeated attempts by OS to read from the same block cause the HDD
 to retry.  It won't throw away your data on its own.

Correct, in all practical cases the hardware will never reallocate a 
bad block on read operations. The SCSI protocol provides for ARRE, 
but as I noted this is never really implemented.

 R3. OS requests write to same logical block.  HDD relocate to
 different physical block, and throws away the bad block.  It can do
 that now, since you've told it you don't want the data that was there,
 by writing new data over it.

Again, exactly what happens is going to vary a lot with the 
particular hardware. Older drives, even parallel ATA drives, 
generally cannot reallocate a spare sector on the fly during normal 
operation, but can only do it during a low-level format operation of 
the whole drive. This is because the reserve of spare sectors on 
such drives is associated with physical zones, so that reallocation 
can only occur during a track-granular write operation.

In my experience, nearly all SCSI drives have AWRE disabled from the 
factory, and it is up to the operating system to enable it. Linux 

Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Michael ODonnell


Anybody else running CentOS5.x (or RHAT equiv?) care to share the
results from this command:

grep -i sync /var/log/* | fgrep -i raid

It looks like the RAIDs on at least seven of our (mostly stock) CentOS5.4
systems are routinely getting broken and going through a resync operation
on a weekly basis at 4:22am which is when that /etc/cron.weekly script
runs that's generating the mismatch_cnt warnings in question.  I mightn't
have noticed this except that we have a script that periodically checks
/proc/mdstat and pops a dialog box on the X display if it appears we
don't have a healthy RAID, and I was astounded to see one on each of
those machines when I arrived this morning.  Heads up...
 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Ben Eisenbraun
On Mon, Feb 22, 2010 at 10:06:57AM -0500, Michael ODonnell wrote:
 Anybody else running CentOS5.x (or RHAT equiv?) care to share the
 results from this command:
 
 grep -i sync /var/log/* | fgrep -i raid

Dude.  Sweet.

 1$ sudo grep -i sync /var/log/* | fgrep -i raid
Password: 
/var/log/messages:Feb 21 04:22:02 sbgrid-dev-architect kernel: md: syncing RAID 
array md0
/var/log/messages:Feb 21 04:22:02 sbgrid-dev-architect kernel: md: syncing RAID 
array md3
/var/log/messages.1:Feb 14 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md2
/var/log/messages.1:Feb 14 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.1:Feb 14 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.2:Feb  7 04:22:01 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.2:Feb  7 04:22:01 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.3:Jan 31 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md2
/var/log/messages.3:Jan 31 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.3:Jan 31 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.4:Jan 24 04:22:06 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.4:Jan 24 04:22:06 sbgrid-dev-architect kernel: md: syncing 
RAID array md3

That's a CentOS 5.4 x86_64 box.

md0 is a mirror of /boot.
md1 (not listed up there) is another mirror on the same disks (/, /home,
/var, swap).
md2 is a RAID 10 data volume.
md3 is another mirror.

-ben

--
if you plan for a year, sow a seed.  if you plan for a decade, plant a 
tree.  if you plan for a century, educate the people. chuang tzu
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Michael ODonnell


Ruh-rohhh

/var/log/messages:   Feb 21 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages:   Feb 21 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md2
/var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.2: Feb 7  04:22:01 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.2: Feb 7  04:22:01 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md2
/var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: syncing 
RAID array md3
/var/log/messages.4: Jan 24 04:22:06 sbgrid-dev-architect kernel: md: syncing 
RAID array md0
/var/log/messages.4: Jan 24 04:22:06 sbgrid-dev-architect kernel: md: syncing 
RAID array md3

That's a CentOS 5.4 x86_64 box.

Ours are, too.

So far, then, it's looking like every Sunday at 4:22 all the RAIDs
(all types or just RAID1?) in standard x86_64 CentOS5.4 (and RHAT?)
boxes are broken and then resync'd.  This is presumably unnecessary
and unintentional.  The harm is that until the resync operations
complete (large devices can take hours) the filesystems on those
RAIDs are essentially as vulnerable to HW faults as they'd be on any
single disk.  (Interactive responsiveness is usually significantly
reduced, as well - important in cases such as ours with customers
active at all hours, but maybe less so in a 9-to-5 environment).

We'll probably disable that helpful weekly script on our machines
until we have a better handle on this (or a fix).
 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good (fwd)

2010-02-22 Thread Michael Bilow

On 2010-02-22 at 13:39 -0500, Michael ODonnell wrote:




Ruh-rohhh

/var/log/messages:   Feb 21 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md0
/var/log/messages:   Feb 21 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md3
/var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md2
/var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md0
/var/log/messages.1: Feb 14 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md3
/var/log/messages.2: Feb 7  04:22:01 sbgrid-dev-architect kernel: md: 
syncing RAID array md0
/var/log/messages.2: Feb 7  04:22:01 sbgrid-dev-architect kernel: md: 
syncing RAID array md3
/var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md2
/var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md0
/var/log/messages.3: Jan 31 04:22:02 sbgrid-dev-architect kernel: md: 
syncing RAID array md3
/var/log/messages.4: Jan 24 04:22:06 sbgrid-dev-architect kernel: md: 
syncing RAID array md0
/var/log/messages.4: Jan 24 04:22:06 sbgrid-dev-architect kernel: md: 
syncing RAID array md3


That's a CentOS 5.4 x86_64 box.


Ours are, too.

So far, then, it's looking like every Sunday at 4:22 all the RAIDs
(all types or just RAID1?) in standard x86_64 CentOS5.4 (and RHAT?)
boxes are broken and then resync'd.  This is presumably unnecessary
and unintentional.  The harm is that until the resync operations
complete (large devices can take hours) the filesystems on those
RAIDs are essentially as vulnerable to HW faults as they'd be on any
single disk.  (Interactive responsiveness is usually significantly
reduced, as well - important in cases such as ours with customers
active at all hours, but maybe less so in a 9-to-5 environment).

We'll probably disable that helpful weekly script on our machines
until we have a better handle on this (or a fix).


Note that Debian has something similar, although monthly:

# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft madd...@madduck.net
# distributed under the terms of the Artistic Licence 2.0
#
# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root [ -x /usr/share/mdadm/checkarray ]  [ $(date +\%d) -le 7 ]  
/usr/share/mdadm/checkarray --cron --all --quiet


Note, however, that checkarray is not a real resynchronization of the same 
kind that occurs when bringing an array out of degraded mode, and data are not 
at risk in the same way. On the other hand, if something interrupts 
checkarray then it is possible for the array to be left in degraded mode, and 
this was the subject of a bug I filed against Debian's mdadm package a while 
ago:


http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=563602

I'm not entirely happy with the maintainer's call that my problem was local, 
but I can't prove otherwise. In the meantime, I've changed the script so that 
checkarray is run on Monday morning at 0757 instead of Sunday at 0057 in 
order to avoid conflict with Debian's weekly log rotation.


It's a matter of opinion whether it is better to risk running checkarray once 
a month for a few hours or to risk never running checkarray and having data 
errors creep into an array. My view is that, while the md code in Linux is 
quite solid, intermittent hardware problems, especially with failing RAM, will 
often be exposed by invocations of checkarray that might otherwise be missed 
until they grow into catastrophic failures, and therefore it is better to do it 
than not do it.


-- Mike
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Benjamin Scott
On Mon, Feb 22, 2010 at 10:06 AM, Michael ODonnell
michael.odonn...@comcast.net wrote:
 Anybody else running CentOS5.x ...

  liberty.gnhlug.org is running CentOS 5.4 with kernel
2.6.18-92.1.22.el5 (still haven't rebooted it).

liberty$ fgrep -i sync /var/log/kernel* | fgrep -i raid
/var/log/kernel:Feb 21 04:22:01 liberty kernel: md: syncing RAID array md0
/var/log/kernel.1:Feb 14 04:22:01 liberty kernel: md: syncing RAID array md0
/var/log/kernel.2:Feb  7 04:22:01 liberty kernel: md: syncing RAID array md0
/var/log/kernel.3:Jan 31 04:22:01 liberty kernel: md: syncing RAID array md0
/var/log/kernel.4:Jan 24 04:22:01 liberty kernel: md: syncing RAID array md0
liberty$

  It appears to be doing that at 4:22 AM (US Eastern) every Saturday,
for device md0.  4:22 AM is also the timestamp on those email messages
about mismatch_cnt I keep getting.

  Interestingly, the md1 name does not appear in the log files.  (To
review: md0 is the /boot/ partition, which is small and quiet.  md1 is
the everything-else LVM PE partition, which hosts swap space and all
other filesystems.)

 It looks like the RAIDs on at least seven of our (mostly stock) CentOS5.4
 systems are routinely getting broken and going through a resync operation
 on a weekly basis at 4:22am which is when that /etc/cron.weekly script
 runs that's generating the mismatch_cnt warnings in question.

  I believe the script is /etc/cron.weekly/99-raid-check.  It
apparently uses a config file /etc/sysconfig/raid-check, which is
well-commented.

  The control flow of the script seems to be: The operation is only
run if the array is in a clean and idle state.  If the array is
degraded or rebuilding, the operation is skipped for that array.  The
default operation is check, not repair.

  The default operation can be either a check or repair, as
specified by CHECK=; on liberty, it is check.  Both operation
types presumably scan all blocks on all members.  In the world of
RAID, this is often called patrol read.  It's a good thing.  Disks
tend to be full of files which are rarely read.  If one of those files
develops a bad sector, the disk won't notice until you try to read it.
 Then when a different disk dies and the RAID subsystem tries to
rebuild from the remaining member(s), you discover your redundant
disks weren't as redundant as you would have liked.

  Assuming the comments in the config file are accurate, the check
operation will only attempt to repair bad sectors.  Exactly how bad
sectors are detected isn't explained, but I presume it means could
not read the block from one member device.  Exactly what repair
means isn't explained, but I presume it means write the block from
the good device to the other device.  (This is a good thing -- any
hard disk manufactured within the past 20 years or so will remap a bad
block once it is written to.  And modern hard disks are virtually
guaranteed to have bad blocks.)

  Again assuming the comments are accurate, check will *not* attempt
to repair mismatches.  Mismatches are when all member devices could be
read, but the data is not consistent across devices.  This is what
mismatch_cnt reportedly reflects.  repair, on the other hand, will
attempt to make the array consistent.  How the repair will choose
which data to keep is not explained, but the phase luck of the draw
is used.

  You can set ENABLED=no in the config file to disable the whole
thing, but before you do that, see above about patrol read.  If you
think patrol read is a bad idea, you're probabbly wrong.

  So!  Having done what I should have done in the first place (RTFM
and RTFS), I know now why the problem was detected, and what my
options for remediation are.  That leaves How did the mismatch occur
in the first place? as my remaining question.

  Based on what I'm seeing (in particular, the mismatch *only* being
in the GRUB stage2 file), I'm going to conclude liberty's mismatch is
due to GRUB being installed on both physical hard disks independently
(booting from floppy).  Whether or not that's the right away to
install GRUB is an open question.

  Assuming that is a valid way to install GRUB, the system should be
fine, including for kernel updates, until and unless one mirror member
fails.  Any updates to kernel files will write to blocks other than
the GRUB stage2 file, and be properly mirrored.  But if a mirror
member dies, then once that bad disk is replaced, the system will copy
the good mirror to the new disk, including the wrong copy of GRUB.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Benjamin Scott
On Mon, Feb 22, 2010 at 1:39 PM, Michael ODonnell
michael.odonn...@comcast.net wrote:
 So far, then, it's looking like every Sunday at 4:22 all the RAIDs
 (all types or just RAID1?) in standard x86_64 CentOS5.4 (and RHAT?)
 boxes are broken and then resync'd.

  All types (as I interpret the script source).

  If the documentation is to be believed, they are not being broken;
they are being checked for consistency.  Not the same thing.  Breaking
and rebuilding leaves the array vulnerable during the rebuild, as you
note.  A consistency check just compares the supposedly identical
members to confirm they really *are* identically, and warns you if
they are not.

  What I find interesting is that I'm not getting log messages from
the kernel about liberty's md1 device -- only md0.  I can think of
two possible reasons for that: (R1) The kernel only logs the message
if a mismatch is discovered.  (R2) The check is not being run on md1
on liberty for some reason.

  If R1 is the case, that implies your system has mismatches across
several arrays, which I would think is a bad sign.

  If R2 is the case, I'd like to know why, and fix it so it works.

 Interactive responsiveness is usually significantly reduced, as well ...

  With a good RAID implementation, I/O for patrol reads is done when
the array is idle.  (Kind of like nice 19 for I/O.)  I don't know if
Linux does this or not.

 We'll probably disable that helpful weekly script on our machines
 until we have a better handle on this (or a fix).

  You may want to determine if you've got mismatches or not before
disabling the script.  It could be it just alerted you to trouble
before it became a disaster.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Benjamin Scott
On Mon, Feb 22, 2010 at 3:06 PM, Benjamin Scott dragonh...@gmail.com wrote:
  I believe the script is /etc/cron.weekly/99-raid-check.
...
  The control flow of the script seems to be: The operation is only
 run if the array is in a clean and idle state.  If the array is
 degraded or rebuilding, the operation is skipped for that array.  The
 default operation is check, not repair.

  P.S.: I should also mention that if an operation is run, the script
goes into a poll/sleep loop, where it checks the status of each array
that was told to repair/check.  If an array is not idle, it sleeps for
3 seconds and repeats the poll.  Once all arrays are idle, it checks
the mismatch_cnt on each array, and reports any which are non-zero.

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Bill McGonigle
On 02/22/2010 03:06 PM, Benjamin Scott wrote:

Based on what I'm seeing (in particular, the mismatch*only*  being
 in the GRUB stage2 file), I'm going to conclude liberty's mismatch is
 due to GRUB being installed on both physical hard disks independently
 (booting from floppy).  Whether or not that's the right away to
 install GRUB is an open question.

Did you ever run a 'repair' on the /boot mirror?  I'm wondering:

1) what's different about the two stage2's?
2) should they be different?
2a) if not, does it matter if they are?
2b) if so, do /boot's that are in sync have a problem?

-Bill

-- 
Bill McGonigle, Owner
BFC Computing, LLC
http://bfccomputing.com/
Telephone: +1.603.448.4440
Email, IM, VOIP: b...@bfccomputing.com
VCard: http://bfccomputing.com/vcard/bill.vcf
Social networks: bill_mcgonigle/bill.mcgonigle
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-22 Thread Benjamin Scott
On Mon, Feb 22, 2010 at 5:51 PM, Bill McGonigle b...@bfccomputing.com wrote:
 Did you ever run a 'repair' on the /boot mirror?

  No.  I've been avoiding that until I have a better understanding of
what's going on.  Heck, I've been avoiding *rebooting* until I have a
better understanding of what's going on.  :)  (Thank goodness this
isn't a Windows box!  ;-)  )

liberty$ uptime
 18:08:11 up 364 days, 18:13,  1 user,  load average: 0.22, 0.08, 0.06
liberty$

  Hmmm.  And now that I see that, I can't reboot before tomorrow.  ;-)

 1) what's different about the two stage2's?

liberty$ sudo umount /boot
liberty$ sudo mdadm --stop /dev/md0
liberty$ sudo mount -t ext2 -o ro /dev/sda1 /mnt/sda1
liberty$ sudo mount -t ext2 -o ro /dev/sdb1 /mnt/sdb1
liberty$ diff -u ( od -c /mnt/sda1/grub/stage2 ) ( od -c
/mnt/sdb1/grub/stage2 )
--- /dev/fd/63  2010-02-22 18:05:49.290275334 -0500
+++ /dev/fd/62  2010-02-22 18:05:49.292275237 -0500
@@ -22,8 +22,8 @@
 *
 760   [ 360  \r  \0 265  \0  \0  \v   B 360  \r  \0 027  \0  \b
 0001000 352   p 202  \0  \0  \0 003 002 377 377  \0  \0  \0  \0  \0  \0
-0001020  \0  \0   0   .   9   7  \0   (   h   d   0   ,   0   )   /   g
-0001040   r   u   b   /   g   r   u   b   .   c   o   n   f  \0  \0  \0
+0001020  \0  \0   0   .   9   7  \0   /   g   r   u   b   /   g   r   u
+0001040   b   .   c   o   n   f  \0   b   .   c   o   n   f  \0  \0  \0
 0001060  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
 *
 0001160 372   1 300 216 330 216 320 216 300   g   f 211   -   $ 220  \0
liberty$ sudo umount /mnt/sda1 /mnt/sdb1
liberty$ sudo mdadm --assemble /dev/md0
mdadm: /dev/md0 has been started with 2 drives.
liberty$ sudo mount /boot
liberty$

 2) should they be different?

  Well, I think they should be different, based on how GRUB was
installed (individually, on each hard disk, booted from floppy).  One
should try to boot from the first BIOS fixed disk, the other should
try to boot from the second BIOS fixed disk.

  However, looking at the difference above, I think they're different
*in the wrong way*.  It looks like one of the disks specifies
(hd0,0)/grub/grub.conf while the other just specifies
/grub/grub.conf.  That doesn't seem right.

  Although I *did* test booting from either disk before deploying this
system.  Although again... there was a lot going on though at the
time.  Maybe I screwed up my tests, too?  Hmmm.

 2a) if not, does it matter if they are?
 2b) if so, do /boot's that are in sync have a problem?

  If GRUB was installed from Linux, against the RAID device (/dev/md0
or whatever), then I would think the two members should be identical,
because the RAID driver in the kernel would have written the same
blocks to both members.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-21 Thread Shawn O'Shea
I know I'm resurrecting an old thread here, but I just saw a post in Planet
CentOS that seems to have some info on fixing the mismatch_cnt is no 0
error, Take a loog at this blog post where the author's suggests some md
actions that can be taken to clear these errors:
http://www.arrfab.net/blog/?p=199

-Shawn

On Sun, Nov 1, 2009 at 10:00 PM, Ben Scott dragonh...@gmail.com wrote:

  CentOS 5.4.  Running kernel is 2.6.18-92.1.22.el5.  The system has
 two disks, each with two partitions, making up two md mirror devices.
 md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk)
 and holds an LVM PE.The following arrived in my mailbox today:

 On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon r...@liberty.gnhlug.org
 wrote:
  /etc/cron.weekly/99-raid-check:
 
  WARNING: mismatch_cnt is not 0 on /dev/md0

  Investigation finds:

 /proc/mdstat reports everything is peachy for both mirrors.  [2/2] [UU]

 Under /sys/block/md0/md/ I find the following:

array_state: clean
mismatch_cnt: 256
rd{0,1}/errors: 0
rd{0,1}/state: in_snyc

  Google finds lots of people reporting similar, but nothing
 conclusive or particularly pertinent to this situation.  Lots of
 people saying that swap can cause this (because swap can commit a
 block to one member, then learn it won't ever re-read that block, and
 so won't bother committing the other member), but this is the /boot
 filesystem, not swap.  (swap is in an LV; the md device backing that
 LVM's sole PE reports a mismatch_cnt of zero.)

  I did find some people saying this started happening after CentOS
 5.3 - 5.4.  I did do that recently.  One person said the raid-check
 was added in 5.4.  So I presume this mismatch_cnt might have been
 non-zero for ages, and I just never knew to look before now.
 mdmonitor has been running, but it mainly reports if a RAID member
 goes offline, and as noted, md is reporting all's quiet on the western
 front.

  I tried dismounting the /boot filesystem and running some tests.
 (Since it's a separate partition and md device, and outside of LVM, I
 can poke at it without taking the system down.)

  e2fsck -f -n says /dev/md0 is okay.

  I tried stopping the RAID device with mdadm --stop /dev/md0, then
 sync'ing disks.  Then I ran cmp /dev/sda1 /dev/sdb1.  The result:

/dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880

  So the two mirror members are **NOT** identical.  That's usually bad.

  Running e2fsck -f -n on each member says no trouble found.  That
 implies whatever the mismatch is, it is not in filesystem metadata.

  Running a badblocks read-only test on each member says no read errors.

  mdadm says the MD superblocks are okay, and comparing the two finds
 most things are the same -- only the checksum and device relationships
 differ (expected).

  One nice thing about simple mirrors is that you can mount the
 members read-only and examine the contents without breaking the mirror
 set.  So:

liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1
liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1
liberty$ sudo diff -r sda1 sdb1
Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ
liberty$

  (You have to mount as ext2 because ext3 will replay a journal even
 if you said read-only.)

  It may be normal for the GRUB stage2 to differ in this
 configuration.  There may be device numbers encoded into them.  GRUB
 was installed on each disk separately, by booting from floppy, so that
 would do it.  Or it could be one disk has an undetected bad block and
 the boot loader on that disk is shot.

  No other differences detected in file data, though.  So between fsck
 and diff, it looks like most of the contents are intact.  Maybe all of
 them.

  I'm unsure as to how to proceed.

  The general procedure for repairing a broken mirror is to resync
 from the good member, assuming you can determine which is good.  My
 problem is, I'm not sure which is the good member, or even if there
 *is* a good member: If GRUB writes different device numbers into the
 boot stage files, the two disks necessarily won't match.  Which, come
 to think of it, is probably something to worry about, since a legit
 mirror resync will scrogg that.

  smartctl -a reveals something that may be relevant.  sda reports
 several non-zero values in the Error counter log section.  No
 uncorrectable errors, but ECC has been used.  At the same time, sdb
 reports all zeros for those same values.  Further, the counts for sda
 have increased since the disks were installed.  (I saved the output of
 smartctl -a back then.  Now you see why.)  Now, ECC usage is not an
 automatic cause for alarm on a modern hard disk, but the fact that sda
 is non-zero and increasing while sdb is zero and flat suggests sdb is
 in better overall health.  However, this probably has nothing to do
 with the mirror mismatch, since both disks report zero *uncorrectable*
 errors.  Uncorrectable media defects 

Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-21 Thread Benjamin Scott
On Sun, Feb 21, 2010 at 4:26 PM, Shawn O'Shea sh...@eth0.net wrote:
 I know I'm resurrecting an old thread here, but I just saw a post in Planet
 CentOS that seems to have some info on fixing the mismatch_cnt is no 0
 error, Take a loog at this blog post where the author's suggests some md
 actions that can be taken to clear these errors:
 http://www.arrfab.net/blog/?p=199

  Right.  In short, run a repair on the mirrors.  I saw someone give
advice before.  My issue is, the only way I know of for a simple
mirror to be repaired is to arbitrarily declare one of the members the
good copy, and copy it all to the other member.  Is that what the
kernel RAID driver does?  If so, how does it decide which member is
the good copy?  is there any way to influence that decision?

  And none of this answers the questions of how it happened in the
first place, nor how it was detected.

  If this is due to GRUB being installed individually on to each
mirror member (by booting from floppy), then I suspect copying one
mirror member to the other is actually the *wrong* thing to do, since
you'll clobber the second disk's unique GRUB installation with the one
from the first disk (presumably, the first disk's GRUB won't work on
the second disk).

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-21 Thread Ben Eisenbraun
On Sun, Feb 21, 2010 at 05:01:21PM -0500, Benjamin Scott wrote:
   And none of this answers the questions of how it happened in the
 first place, nor how it was detected.

From the horse's mouth:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919#41

-ben

--
you'll find that the only thing you can do easily is be wrong, and 
that's hardly worth the effort.norton juster
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-21 Thread Michael ODonnell


 From the horse's mouth:

 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919#41

I think I posted that same link (the incorrect Beatles quote looks
familiar) when this thread was last active and IIRC we agreed that
although it was somewhat comforting that the author believed there was
no (additional) cause for immediate concern it didn't really explain
the situation in our cases as our RAIDs contain no swap storage.
 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-21 Thread Benjamin Scott
On Sun, Feb 21, 2010 at 5:56 PM, Ben Eisenbraun b...@klatsch.org wrote:
   And none of this answers the questions of how it happened in the
 first place, nor how it was detected.

 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919#41

  The md device in question does not contain a swap device.  Link also
does not explain how it was detected.  Nor why /proc/mdstat says
things are good when something else says not.

  Since the original thread was four months ago, it's not surprising
you're unaware of this.  For those just joining us, you may want to
read the archives:

http://thread.gmane.org/gmane.org.user-groups.linux.gnhlug/18434

  :-)

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2010-02-21 Thread Ben Eisenbraun
   Since the original thread was four months ago, it's not surprising
 you're unaware of this.

Nods, I tripped myself up.

I just had this same conversation, WTF is a mimatch_cnt and does it
matter? a few weeks ago, and that link was in my browser history, so I
easily recalled it.

When Michael pointed out that Neil's commentary had already been brought
up, I had to grep through the memory banks (i.e. email, IRC and AIM logs,
browser history, etc) to figure out that...  I had the conversation with a
totally different group of people.

That's what I get for posting after a long day of work.  :P

-ben

--
when the going gets weird, the weird turn pro.   hunter thompson
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-19 Thread Michael ODonnell


Saw the warning in question today on a CentOS5.4 box so I STFW and found these:

   http://forum.nginx.org/read.php?24,16699

   http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919

...with a plausible (and somewhat comforting) nutsbolts level explanation 
toward
the bottom of the latter.  Summary: annoying but apparently not harmful.
 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-19 Thread Ben Scott
On Thu, Nov 19, 2009 at 5:48 PM, Michael ODonnell
michael.odonn...@comcast.net wrote:
 ...with a plausible (and somewhat comforting) nutsbolts level explanation 
 toward
 the bottom of the latter.  Summary: annoying but apparently not harmful.

  As I understand it, the explanation in question applies to swap
partitions only.  (Swap is somewhat unique in that the kernel can
know a given block on disk will never be read again.)

  For liberty.gnhlug.org, the partition in question is an EXT3
filesystem.  Interestingly enough, the LVM PV which contains an LV
with swap on it is not reporting a non-zero mismatch_cnt (yet).

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-19 Thread Michael ODonnell


 As I understand it, the explanation in question applies to swap
 partitions only.  (Swap is somewhat unique in that the kernel can
 know a given block on disk will never be read again.)

Yah, you're referring to the case where a page that's just been written
out is immediately dirtied again so those freshly written disk data are
abandoned since they're no longer the most recent bits representing
the page in question.  He indicates that he thinks this might also
be possible for memory mapped files but now that I've read it again I
confess I don't understand his follow-up:

 This can conceivably happen with out swap being part of the picture,
 if memory mapped files are used.  However in this case it is less
 likely to remain out-of-sync and dirty file data will be written
 out soon, where as there is no guarantee that dirty anonymous
 pages will be written to swap in any particular hurry, or at any
 particular location.

...so I guess the only thing we know for sure is that these guys know the
problem exists and don't believe it to be dire.  FWIW, as in your case
the MD device this was reported against is not engaged in any swapping.

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-19 Thread Ben Scott
On Thu, Nov 19, 2009 at 7:13 PM, Michael ODonnell
michael.odonn...@comcast.net wrote:
 Yah, you're referring to the case where a page that's just been written
 out is immediately dirtied again so those freshly written disk data are
 abandoned since they're no longer the most recent bits representing
 the page in question.

  Right:

A1. Kernel decides a memory page is LRU and begins to swap it
A2. Swap file block on one mirror member is written
A3. In-memory page is written to (dirtied)

  Once A3 happens, the page is no longer LRU, and thus no longer a
candidate for swapping.  The kernel doesn't have anything to write to
the second mirror member, since the page changed.  And there's no
point in writing that page to both disks just for the sake of keeping
the mirror consistent.  That block's contents will never be read; it's
invalidated swap.  If the system does need that block for swapping
again, it will have to write the new page contents to both mirror
members anyway.

  He indicates that he thinks this might also
 be possible for memory mapped files but now that I've read it again I
 confess I don't understand his follow-up ...

  Yah.  Swap is special because the kernel can know a block will never
be read again.  For a regular file, it's a lot harder to make that
determination.

  The only things I can think of would be ftruncate(2) and unlink(2).
Maybe there's a lot of dirty blocks (queued writes) in the process of
being written, when the file is truncated or unlinked.  The kernel now
knows those blocks will never be read, so it doesn't worry that it
started writing them but didn't finish.

  But that's just a guess.  It'd be nice to hear from someone who
actually knows what they're talking about.

 ...so I guess the only thing we know for sure is that these guys know the
 problem exists and don't believe it to be dire.

  Morton Thiokol didn't think cold O-rings would be a dire problem,
either... :-/

 FWIW, as in your case the MD device this was
 reported against is not engaged in any swapping.

  Not only that, but for liberty.gnhlug.org, at least, the partition
in question is the boot partition (/boot/).  There's very little on
there -- a couple kernel images, basically.  I wouldn't think the
truncate/unlink scenarios would ever happen to those files.

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-03 Thread Bill McGonigle
Grub shouldn't make /dev/md0 inconsistent (though it ought make the boot
sectors of the two drives inconsistent).

  grub-install /dev/md0

should re-generate the stage2 from /usr/share/grub, so if you drop
either mirror half and re-add it, and re-install grub then things
_should_ be normal.

Now, do we know what kind of ECC the drive does?  It sounds like a
multi-bit error wasn't handled (or the ECC electronics are failing in an
infuriating fashion).

-Bill

-- 
Bill McGonigle, Owner
BFC Computing, LLC
http://bfccomputing.com/
Telephone: +1.603.448.4440
Email, IM, VOIP: b...@bfccomputing.com
VCard: http://bfccomputing.com/vcard/bill.vcf
Social networks: bill_mcgonigle/bill.mcgonigle
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-03 Thread Ben Scott
On Tue, Nov 3, 2009 at 6:06 PM, Bill McGonigle b...@bfccomputing.com wrote:
 Grub shouldn't make /dev/md0 inconsistent ...

  It might, if you install it by booting from floppy and running setup
on the two disks independently, which is what I did.

  GRUB doesn't understand Linux RAID.  It just happens to be able to
boot a Linux simple mirror because a simple mirror is the same as a
single disk if you're only reading.  Which is what GRUB normally does
-- unless you're installing it.  :)

  I installed it this way for few reasons.  One is that the GRUB docs
say it's the preferred method, as there's no really reliable way to
map the kernel's idea of your disks to BIOS drive numbers.  Another is
that in the past, I've found the grub-install method didn't actually
work during a failure of the first disk in the mirror set, while
installing from floppy did work (you could boot from either drive).

  Maybe I relied upon bad/old information?

 Now, do we know what kind of ECC the drive does?  It sounds like a
 multi-bit error wasn't handled (or the ECC electronics are failing in an
 infuriating fashion).

  That's certainly possible, but lacking better data, I'm more
inclined to suspect the GRUB installer.

  The thing that really gets me: How can mismatch_cnt be non-zero,
while everything else is saying the mirror is good?  I would think
mismatched blocks pretty much defines the out-of-sync mirror failure
condition.

  Plus, how did md detect the mismatches in the first place?  Does the
md driver periodically run a compare on its own?  Does mdmonitor
periodically trigger one?

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-01 Thread Ben Scott
  CentOS 5.4.  Running kernel is 2.6.18-92.1.22.el5.  The system has
two disks, each with two partitions, making up two md mirror devices.
md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk)
and holds an LVM PE.The following arrived in my mailbox today:

On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon r...@liberty.gnhlug.org wrote:
 /etc/cron.weekly/99-raid-check:

 WARNING: mismatch_cnt is not 0 on /dev/md0

  Investigation finds:

/proc/mdstat reports everything is peachy for both mirrors.  [2/2] [UU]

Under /sys/block/md0/md/ I find the following:

array_state: clean
mismatch_cnt: 256
rd{0,1}/errors: 0
rd{0,1}/state: in_snyc

  Google finds lots of people reporting similar, but nothing
conclusive or particularly pertinent to this situation.  Lots of
people saying that swap can cause this (because swap can commit a
block to one member, then learn it won't ever re-read that block, and
so won't bother committing the other member), but this is the /boot
filesystem, not swap.  (swap is in an LV; the md device backing that
LVM's sole PE reports a mismatch_cnt of zero.)

  I did find some people saying this started happening after CentOS
5.3 - 5.4.  I did do that recently.  One person said the raid-check
was added in 5.4.  So I presume this mismatch_cnt might have been
non-zero for ages, and I just never knew to look before now.
mdmonitor has been running, but it mainly reports if a RAID member
goes offline, and as noted, md is reporting all's quiet on the western
front.

  I tried dismounting the /boot filesystem and running some tests.
(Since it's a separate partition and md device, and outside of LVM, I
can poke at it without taking the system down.)

  e2fsck -f -n says /dev/md0 is okay.

  I tried stopping the RAID device with mdadm --stop /dev/md0, then
sync'ing disks.  Then I ran cmp /dev/sda1 /dev/sdb1.  The result:

/dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880

  So the two mirror members are **NOT** identical.  That's usually bad.

  Running e2fsck -f -n on each member says no trouble found.  That
implies whatever the mismatch is, it is not in filesystem metadata.

  Running a badblocks read-only test on each member says no read errors.

  mdadm says the MD superblocks are okay, and comparing the two finds
most things are the same -- only the checksum and device relationships
differ (expected).

  One nice thing about simple mirrors is that you can mount the
members read-only and examine the contents without breaking the mirror
set.  So:

liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1
liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1
liberty$ sudo diff -r sda1 sdb1
Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ
liberty$

  (You have to mount as ext2 because ext3 will replay a journal even
if you said read-only.)

  It may be normal for the GRUB stage2 to differ in this
configuration.  There may be device numbers encoded into them.  GRUB
was installed on each disk separately, by booting from floppy, so that
would do it.  Or it could be one disk has an undetected bad block and
the boot loader on that disk is shot.

  No other differences detected in file data, though.  So between fsck
and diff, it looks like most of the contents are intact.  Maybe all of
them.

  I'm unsure as to how to proceed.

  The general procedure for repairing a broken mirror is to resync
from the good member, assuming you can determine which is good.  My
problem is, I'm not sure which is the good member, or even if there
*is* a good member: If GRUB writes different device numbers into the
boot stage files, the two disks necessarily won't match.  Which, come
to think of it, is probably something to worry about, since a legit
mirror resync will scrogg that.

  smartctl -a reveals something that may be relevant.  sda reports
several non-zero values in the Error counter log section.  No
uncorrectable errors, but ECC has been used.  At the same time, sdb
reports all zeros for those same values.  Further, the counts for sda
have increased since the disks were installed.  (I saved the output of
smartctl -a back then.  Now you see why.)  Now, ECC usage is not an
automatic cause for alarm on a modern hard disk, but the fact that sda
is non-zero and increasing while sdb is zero and flat suggests sdb is
in better overall health.  However, this probably has nothing to do
with the mirror mismatch, since both disks report zero *uncorrectable*
errors.  Uncorrectable media defects would certainly cause a mirror
mismatch, but the drives think they've been able to handle everything
so far.

  There are newer kernels available; the system hasn't been rebooted
in 251 days.  But I'm somewhat loathe to try rebooting with /boot in a
suspect state.

  The thing I find really confusing is why mismatch_cnt can be
non-zero while the rest of the in-kernel md monitoring stuff reports
everything is good.

  Anyone here have 

Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good

2009-11-01 Thread Bruce Dawson
Ben Scott wrote:
   CentOS 5.4.  Running kernel is 2.6.18-92.1.22.el5.  The system has
 two disks, each with two partitions, making up two md mirror devices.
 md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk)
 and holds an LVM PE.The following arrived in my mailbox today:

 On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon r...@liberty.gnhlug.org wrote:
   
 /etc/cron.weekly/99-raid-check:

 WARNING: mismatch_cnt is not 0 on /dev/md0
 


Odd, we had a similar report on one of our Ubuntu systems last night. I
suspect there's a date-sensitive bug in there somewhere - possibly
tickled by a counter overflow (that system has been up forever too).

But consider this a wild guess.

I suspect a reboot will produce acceptable results, given the results of
your comparisons.

--Bruce
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/