Re: Rebuild after disk fail

Craig Sanders via luv-main Wed, 29 Jan 2020 22:15:23 -0800

On Tue, Jan 28, 2020 at 08:02:15PM +1100, russ...@coker.com.au wrote:
> Having a storage device fail entirely seems like a rare occurance.  The only
> time it happened to me in the last 5 years is a SSD that stopped accepting
> writes (reads still mostly worked OK).


it's not rare at all, but a drive doesn't have to be completely non-responsive
to be considered "dead".  It just has to consistently cause enough errors that
it results in the pool being degraded.

I recently had a seagate ironwolf 4TB drive that would consistently
cause problems in my "backup" pool (8TB in two mirrored pairs of 4TB
drives, i.e. RAID-10, containing 'zfs send' backups of all my other
machines). Whenever it was under moderately heavy load, it would cause enough
errors to be kicked, degrading the pool.  I didn't have a spare drive to
replace it immediately, so just "zpool clear"-ed it several times.  Running a
scrub on that pool with that drive was guaranteed to degrade the pool within
minutes.

and, yeah, i moved it around to different SATA & SAS ports just in case it was
the port and not the drive. nope. it was the drive.

To me, that's a dead drive because it's not safe to use. it can not be trusted
to reliably store data. it is junk. the only good use for it is to scrap it
for the magnets.


(and, btw, that's why I use ZFS and used to use RAID. Without redundancy from
RAID-[156Z] or similar, such a drive would result in data loss. Even worse,
without the error detection and correction from ZFS, such a drive would result
in data corruption).

> I've had a couple of SSDs have checksum errors recently and a lot of hard
> drives have checksum errors.  Checksum errors (where the drive returns what
> it considers good data but BTRFS or ZFS regard as bad data) are by far the
> most common failures I see of the 40+ storage devices I'm running in recent
> times.

a drive that consistently returns bad data is not fit for purpose. it is junk.
it is a dead drive.

> BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware
> issues that I've seen in the last 5+ years.

IMO, two copies of data on a drive you can't trust isn't significantly better
or more useful than one copy. It's roughly equivalent to making a photocopy
of your important documents and then putting both copies in the same soggy
cardboard box in a damp cellar.

If you want redundancy, use two or more drives. Store your important documents
in two or more different locations.

and backup regularly.


> > If a drive is failing, all the read or write re-tries kill performance on
> > a zpool, and that drive will eventually be evicted from the pool. Lose
> > enough drives, and your pool goes from "DEGRADED" to "FAILED", and your
> > data goes with it.
>
> So far I haven't seen that happen on my ZFS servers.  I have replaced at
> least 20 disks in zpools due to excessive checksum errors.

I've never had a pool go to FAILED state, either.  I've had pools go to
DEGRADED *lots* of times.  And almost every time it comes after massive
performance drops due to retries - which can be seen in the kernel
logs. Depending on the brand, you can also clearly hear the head re-seeking as
it tries again and again to read from the bad sector.

More importantly, it's not difficult or unlikely for a pool go from being
merely DEGRADED to FAILED.

A drive doesn't have to fail entirely for it be kicked out of the pool, and if
you have enough drives kicked out of a vdev or a pool (2 drives for mirror or
raidz-1, 3 for raidz-2, 4 for raidz-3), then that entire vdev is FAILED, not
just DEGRADED, and the entire pool will likely be FAILED(*) as a result.

That's what happens when there are not enough working drives in a vdev to
store the data that's supposed to be stored on it.

And the longer you wait to replace a dead/faulty drive, the more likely it
is that another drive will die while the pool is degraded.  Which is why
best practise is to replace the drive ASAP...and also why zfs and some other
raid/raid-like HW & SW support "spare" devices to automatically replace them.


(*) there are some pool layouts that are resistant (but not immune) to failing
- e.g. a mirror of any vdev with redundancy, such as a mirrored pair of raidz
vdevs. which is why RAID of any kind is not a substitute for backups.

craig

--
craig sanders <c...@taz.net.au>
_______________________________________________
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

Re: Rebuild after disk fail

Reply via email to