On 7/7/06, Darren Reed <[EMAIL PROTECTED]> wrote:
To put the cat amongst the pigeons here, there were those
within Sun that tried to tell the ZFS team that a backup
program such as zfsdump was necessary but we got told
that amanda and other tools were what people used these
days (in corporate accounts) and therefore zfsdump and
zfsrestore wasn't necessary...

Why aren't you using amanda or something else that uses
tar as the means by which you do a backup?

In any environment where you have lots of systems that need to be
backed up it is reasonable to expect that people will have Amanda,
Netbackup, etc.  However, it would be somewhere between really nice
and essential to have a mechanism for restores to be able to preserve
snapshots.

For example, suppose I had a 1 TB storage pool that started out with
500 GB of data.  Each developer/user/whatever gets a clone of that 500
GB, but each one only changes about 5% of it (25 GB).  If there are
100 clones, this implies that the pool should be about 75% used.
However, a full backup using tar is going to back up 50 TB rathern
than 750 GB.  A space-efficient restore is impossible.  Perhaps an
easier to understand scenario would be a 73 GB storage pool that has
/, a "master" full root zone, and 50 zones that resulted from "zfs
clone" of the master zone.

It seems as though there are a couple of possible ways to work with this:

1) A ZFS to NDMP translator.  This would (nearly) automatically get
you support for space efficient backups for everything that supports
NetApp.  (Based upon a rough understanding of NDMP - someone else will
likely correct me.)  To me, this sounds like the most "enterprise"
type of solution.

2) Disk to disk to tape.  Use the appropriate "zfs send" commands to
write data streams as files on a different file system.  Use your
favorite tar-based backup solution to get those backup streams to
tape.  This will require you to double the amount of storage you have
available (perhaps compression and larger slower disks make this
palatable).  It perhaps makes scheduling backups easier (more
concurrent streams during off peak time writing to disk).  If the
backup streams are still on disk, restores are much quicker.  However,
if restores have to come from tape they take longer.

3) ZFS ability to recognize duplicate blocks and store only one copy.
I'm not sure the best way to do this, but my thought was to have ZFS
remember what the checksums of every block are.  As new blocks are
written, the checksum of the new block is compared to known checksums.
If there is a match, a full comparison of the block is performed.  If
it really is a match, the data is not really stored a second time.  In
this case, you are still backing up and restoring 50 TB.

I think that option 3 is pretty bad to solve the backup/restore
problem because there could be just way too much IO when you are
dealing with lots of snapshots or clones being backed up.  It would
quite likely be worthwhile in the case of cloned zones when it comes
time to patch.  Suppose you have 50 full root zones cloned from one
master on a machine.  For argument's sake, let's say that the master
zone uses 2 GB of space and each clone uses an additional 200 MB.  The
50 zones plus the master start out using a total of 12 GB.  Let's
assume that the recommended patch set is 300 MB in size.   This
implies that there will be somewhere between 0 (already fully patched)
and 450 MB (including compressed backout data) used per zone each time
it is patched.  Taking the middle of the road, this means that each
time the recommended cluster is applied, another 11 GB of disk is
used.  At this rate, it doesn't take long to burn through a 73 GB
disk.  However, if ZFS could "de-duplicate" the blocks, each patch
cycle would take up only a couple hundred megabytes.  But I guess that
is off-topic. :)

Mike

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to