Re: [opensuse] Re: Hard Disk Failing

G T Smith Sat, 10 Nov 2007 02:15:09 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eberhard Roloff wrote:
> Randall R Schulz wrote:
>> On Friday 09 November 2007 09:43, Robert Smits wrote:
>>> On Friday 09 November 2007 01:12:31 G T Smith wrote:
>>>> Robert Smits information rather confirms what I have suspected for
>>>> some time about how one should assess a S.M.A.R.T report,
>>>> unfortunately Robert did not give a link for the paper he referred
>>>> to. I would be interested to have a look at it ...
>>> Happy to oblige.....
>>>
>>> http://209.85.163.132/papers/disk_failures.pdf
>> Thanks.
>>
>> Oddly enough, when I went to download that into my publications 
>> directory, I discovered I already had a copy that I downloaded back in 
>> February and which is byte-for-byte identical.
>>
>>
> 
> While this study is great, one should not forget that the google usage
> environment of hundreds of thousands disks is not directly comparable to
> what most people do at work or at home.
> 
> I.e. most people do not work in air-conditioned data centers and most
> desktops do not run 24x7.
> 
> So while the google paper is certainly informative and a rare beast in
> regard to the observation of a very large population of commodity
> harddisks, I would not dare to use any of it's conclusions lightly for
> my home usage pattern.
> 
> regards
> Eberhard
>


Thanks Rob for the link...

The paper it is extremely useful but possibly flawed.

Eberhard may have missed a couple of points that are probably relevant
to home usage. The most important being that there seems to be slight
increase in failure rate if the drive has light usage, and the failure
pattern of the quaintly labelled 'infant mortality' in which drives are
more likely to fail early in use or when the drive is getting on a bit
(but the latter is more of a confirmation of what I expect most of us
know already).

The most difficult problem with the paper is the definition of failure,
S.M.A.R.T. mainly reports on the media access status not so much the
reliability of electronics controlling that media. As one of the most
significant events in recent time was a recall of a large number of a
particular manufacturers drives due to poor quality of the latter the
failure to distinguish between media failure and electronic failure is
problematic. As this is a difficult problem to handle one cannot lay
fault at the authors for this, but one does need to take it
consideration when considering their results.

Four parameters are identified as being critical, but the concentration
on annualized failure rate without analysis of mean time to failure
weakens the analysis somewhat. There is also an issue in that they
report on survival rates after the first event but do not report on
secondary events. Survival rates of the drive if there were no
subsequent failures reports would have been useful. I would make similar
observation on the various sector error counts that they examine. The
mean time to failure statistics is also possibly more useful to those
dealing with a small quantity of drives.

I think the most interesting part is the conclusion that the S.M.A.R.T.
indicators are probably nearly useless in predicting the failure or
survival of an individual drive on their own, and that there are really
only four values one should take notice of. (This does not mean do not
use S.M.A.R.T., it means take S.M.A.R.T for what it is, a useful tool
for flagging a potential problem). If your are seeing a S.M.A.R.T. error
but the file systems on the drive pass all integrity tests there is a
fairly good chance this a is false (or non-critical) positive but one
should monitor the situation and if the values change adversely take
appropriate action. (in other words DONT PANIC!).

My conclusion is that this only emphasises the need for a good backup
strategy preferably with two independent approaches if one feels that
paranoid. Also for a good guarantee of the data integrity of that you
wish to backup to invest in at least dual drive Raid 1 to ensure what
you back up is not effected by hardware issues. (No guarantee against
software SNAFUs of course).

Of the S.M.A.R.T reports a scan error is probably the error that is of
most concern. Drives with sector allocation related errors, if the
values do not change one could probably still use for non critical
testing or in configurations such RAID where there is some redundancy.


- --
==============================================================================
I have always wished that my computer would be as easy to use as my
telephone.
My wish has come true. I no longer know how to use my telephone.

Bjarne Stroustrup
==============================================================================

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFHNYSUasN0sSnLmgIRAoDEAKDZZoyrog1irAGP7NB/ZUB/zDp6wgCfeDlL
DNhJ2hGbqSNBbZGosXekqU8=
=Jgu7
-----END PGP SIGNATURE-----
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [opensuse] Re: Hard Disk Failing

Reply via email to