On 3 Mar 2010, at 14:00, Mark Knecht wrote:
On Wed, Mar 3, 2010 at 4:24 AM, Stroller <strol...@stellar.eclipse.co.uk > wrote:
There seem to have been a few people posting with filesystem corruption in the last week or two. It seems to be my turn, so I hope it isn't contagious. The cause here is quite clear - whilst rummaging in the server cupboard
yesterday, power to the machine was accidentally disconnected.
...
 Sorry for your problems. I've had a rash of machine problems over
the last 6 weeks. No fun. I feel for you.

  In my most recent case what looked like a simple disk corruption
problem was really a prelude to the drive just plain going bad. Have
you tried smartctl to see what it says about the drive at this point?

  It would be even more frustrating to chroot in, do all the work,
think you had it fixed and then the underlying foundation of your
house crumbles beneath you 3 weeks from now.

I don't think this is a problem. I would love to know what others think of the `smartctl` output:


r...@sysresccd /root % smartctl -H /dev/sda
smartctl version 5.38 [i486-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Seconds 0x0012 001 001 020 Old_age Always FAILING_NOW 44803h+12m+16s

r...@sysresccd /root % smartctl -i /dev/sda
smartctl version 5.38 [i486-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Fujitsu MPA..MPG series
Device Model:     FUJITSU MPF3204AT
Serial Number:    05030567
Firmware Version: 0028
User Capacity:    20,496,236,544 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   5
ATA Standard is:  ATA/ATAPI-5 T13 1321D revision 1
Local Time is:    Wed Mar  3 14:14:31 2010 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

r...@sysresccd /root %


This looks to me like smartctl is going "OMG! What an ancient drive!" - it's a 20gig EIDE drive and if my pocket calculator is correct (44803/24/365), it's seen 5 years of active use - and that's the "marginal attribute" referred to.

Like I said, the power plug was accidentally pulled on this drive, so I'm inclined to attribute the corruption only to that, not to the drive actually failing.

The drive is in a computer that has rarely been turned off in the last couple of years, and is also in a warm environment, conditions which are ideal. I appreciate the latter seems unintuitive, but in fact studies have showed that drives in somewhat warm environments last longer than those that are cooled.

That it passes the "SMART overall-health self-assessment test" suggests to me that it is chugging away quite happily.

I would have dismissed your concerns were it not for the capitalised "FAILING_NOW" in the output. Like I say, I think this is just smartctl declaring "OMG! this drive is old!", but I open this matter to the list for discussion (should you wish).

I think I'm actually nearly ready to migrate off this system. The power was actually pulled as I installed 3 new (to me) rackmount machines in the server cupboard - the plan is to have identical machines running RAID, so that in the case of ANY problems I have spares available. I have take nightly backups of the important data on this machine, however I'd prefer it to run just a couple or a few weeks longer to allow me to migrate at my own leisure.

Stroller.


Reply via email to