On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn
<bfrie...@simple.dallas.tx.us> wrote:
> On Fri, 13 Feb 2009, Ross wrote:
>>
>> Something like that will have people praising ZFS' ability to safeguard
>> their data, and the way it recovers even after system crashes or when
>> hardware has gone wrong.  You could even have a "common causes of this
>> are..." message, or a link to an online help article if you wanted people to
>> be really impressed.
>
> I see a career in politics for you.  Barring an operating system
> implementation bug, the type of problem you are talking about is due to
> improperly working hardware.  Irreversibly reverting to a previous
> checkpoint may or may not obtain the correct data.  Perhaps it will produce
> a bunch of checksum errors.

Yes, the root cause is improperly working hardware (or an OS bug like
6424510), but with ZFS being a copy on write system, when errors occur
with a recent write, for the vast majority of the pools out there you
still have huge amounts of data that is still perfectly valid and
should be accessible.  Unless I'm misunderstanding something,
reverting to a previous checkpoint gets you back to a state where ZFS
knows it's good (or at least where ZFS can verify whether it's good or
not).

You have to consider that even with improperly working hardware, ZFS
has been checksumming data, so if that hardware has been working for
any length of time, you *know* that the data on it is good.

Yes, if you have databases or files there that were mid-write, they
will almost certainly be corrupted.  But at least your filesystem is
back, and it's in as good a state as it's going to be given that in
order for your pool to be in this position, your hardware went wrong
mid-write.

And as an added bonus, if you're using ZFS snapshots, now your pool is
accessible, you have a bunch of backups available so you can probably
roll corrupted files back to working versions.

For me, that is about as good as you can get in terms of handling a
sudden hardware failure.  Everything that is known to be saved to disk
is there, you can verify (with absolute certainty) whether data is ok
or not, and you have backup copies of damaged files.  In the old days
you'd need to be reverting to tape backups for both of these, with
potentially hours of downtime before you even know where you are.
Achieving that in a few seconds (or minutes) is a massive step
forwards.

> There are already people praising ZFS' ability to safeguard their data, and
> the way it recovers even after system crashes or when hardware has gone
> wrong.

Yes there are, but the majority of these are praising the ability of
ZFS checksums to detect bad data, and to repair it when you have
redundancy in your pool.  I've not seen that many cases of people
praising ZFS' recovery ability - uberblock problems seem to have a
nasty habit of leaving you with tons of good, checksummed data on a
pool that you can't get to, and while many hardware problems are dealt
with, others can hang your entire pool.


>
> Bob
> ======================================
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to