RE: [gentoo-user] Hard drive error from SMART

Laurence Perkins Tue, 12 Apr 2022 10:21:36 -0700

> -----Original Message-----
> From: Dale <rdalek1...@gmail.com> 
> Sent: Tuesday, April 12, 2022 10:08 AM
> To: gentoo-user@lists.gentoo.org
> Subject: Re: [gentoo-user] Hard drive error from SMART
> 
> Rich Freeman wrote:
> > On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1...@gmail.com> wrote:
> >> Thoughts.  Replace as soon as drive arrives or wait and see?
> >>
> > So, first of all just about all my hard drives are in a RAID at this 
> > point, so I have a higher tolerance for issues.
> >
> > If a drive is under warranty I'll usually try to see if they will RMA 
> > it.  More often than not they will, and in that case there is really 
> > no reason not to.  I'll do advance shipping and replace the drive 
> > before sending the old one back so that I mostly have redundancy the 
> > whole time.
> >
> > If it isn't under warranty then I'll scrub it and see what happens.
> > I'll of course do SMART self-tests, but usually an error like this 
> > won't actually clear until you overwrite the offline sector so that 
> > the drive can reallocate it.  A RAID scrub/resilver/etc will overwrite 
> > the sector with the correct contents which will allow this to happen.
> > (Otherwise there is no way for the drive to recover - if it knew what 
> > was stored there it wouldn't have an error in the first place.)
> >
> > If an error comes back then I'll replace the drive.  My drives are 
> > pretty large at this point so I don't like keeping unreliable drives 
> > around.  It just increases the risk of double failures, given that a 
> > large hard drive can take more than a day to replace.  Write speeds 
> > just don't keep pace with capacities.  I do have offline backups but I 
> > shudder at the thought of how long one of those would take to restore.
> >
> 
> 
> Sadly, I don't have RAID here but to be honest, I really need to have it 
> given the data and my recent luck with hard drives.  Drives used to get 
> dumped because they were just to small to use anymore.  Nowadays, they seem 
> to break in some fashion long before their usefulness ends their lives. 
> 
> I remounted the drives and did a backup.  For anyone running up on this, just 
> in case one of the files got corrupted, I used a little trick to see if I can 
> figure out which one may be bad if any.  I took my rsync commands from my 
> little script and ran them one at a time with --dry-run added.  If a file was 
> to be updated on the backup that I hadn't changed or added, I was going to 
> check into it before updating my backups.  It could be that the backup file 
> was still good and the file on my drive reporting problems was bad.  In that 
> case, I would determine which was good and either restore it from backups or 
> allow it to be updated if needed.  Either way, I should have a good file 
> since the drive claims to have fixed the problem.  Now let us pray.  :-D 
> 
> Drive isn't under warranty.  I may have to start buying new drives from 
> dealers.  Sometimes I find drives that are pulled from systems and have very 
> few hours on them.  Still, warranty may not last long.  Saves a lot of money 
> tho. 
> 
> USPS claims drive is on the way.  Left a distribution point and should update 
> again when it gets close.  First said Saturday, then said Friday.  I think 
> Friday is about right but if the wind blows right, maybe Thursday. 
> 
> I hope I have another port and power cable plug for the swap out.  At least 
> now, I can unmount it and swap without a lot of rebooting.  Since it's on 
> LVM, that part is easy.  Regretfully I have experience on that process.  :/
> 
> Thanks to all. 
> 
> Dale
> 
> :-)  :-) 
> 
> 
You can get up to 16X SATA PCI-e cards these days for pretty cheap.  So as long 
as you have the power to run another drive or two there's not much reason not 
to do RAID on the important stuff.  Also, the SATA protocol allows for port 
expanders, which are also pretty cheap.


One of my favorite things about BTRFS is the data checksums.  If the drive 
returns garbage, it turns into a read error.  Also, if you can't do real RAID, 
but have excess space you can tell it to keep two copies of everything.  
Doesn't help with total drive failure, but does protect against the occasional 
failed sector.  If you don't mind writes taking twice as long anyway.

LMP

RE: [gentoo-user] Hard drive error from SMART

Reply via email to