On Mon, Apr 13, 2009 at 02:20:34PM -0700, Rogan Creswick wrote:
> I started using Dirvish to manage backups of a couple important
> directories at work in early March, and I just realized that the
> nightly backups are taking up an enormous amount of space (about 30x
> more than expected ;).

You should understand that dirvish uses hard links to "duplicate"
copies of files that are identical in time.  Files that make
small changes over time are not identical, and this includes
accumulating log files, databases, git repositories, etc.  Change
one byte in the binary glob that represents a git repository, and
it is a "new" file.

Yes, there are tools like rdiff-backup that compute and store the
differences between files.  This uses less room.  But before using
any backup tool, contemplate the end goal.  The goal is /not/ to
make /backups/, but to make occasional /restores/.  And what are
you restoring, under what conditions?  Sometimes it is a damaged
file, sometimes it is hacker damage, sometime it is a zapped disk.
Often, you want to get at the version of a file made three weeks
ago, not last night's version.  Sometimes, the same problem that
damaged your original can damaged your backup copy (if they are
both mounted and online).  

In all these cases, every restore is /unplanned/, and occurs at
times when you are busy and haven't set aside time to recover a
file or rebuild a disk.  The major advantage of a tool like 
dirvish is that the images are complete file systems, ready to
copy onto a hard drive.  The backups are searchable and executable
file systems.  And if you manage your data correctly, you can get
a heck of a lot of images onto a large hard drive ( 1500GB hard
drives cost less than $120 on sale).  

But backups will have special cases, usually involving binary
blobs like vmware images and databases.  Often those already 
have "backup nature", the current copy already contains the old
copies, so you don't need more than one copy (plus a few 
redundant copies).  For that you can use the dirvish "expire"
feature, aggressively, in a different vault.  I have a "binary
glob" directory on my system disk with soft links to these big
files, and exclude them from the main backups.  That glob 
directory is stored in a separate vault, and expired after a
few days.
  
With tools like rdiff-backup, you get the daily copies, but
the files are not stored as a file system and not executable,
so they are typically restored manually one by one.  This is
an option for people with inconsequential lives, who aren't
delaying anything important by spending an occasional day or
two rebuilding hard drives. 

Personally, I go for big drives and appropriate backup strategies,
so that restores can be done with a few lines of commands,
possibly restoring vaults from a mix of times.   After a
belatedly discovered system compromise, or more likely a botched
update, I can restore yesterday's user files and last month's
system binaries.

IMHO, no backup system is designed correctly, as a pre-restore
system.  Whatever way you go, there will be inconveniences and
compromises, because these systems are designed to packrat data,
not recover the right data to the right places, in a hurry.  
Whatever backup system you choose, you should test your restore 
procedure onto a blank hard drive (which you should have handy,
just in case), and make your decisions based on how easily that
goes.  Otherwise, backups are just a religious ritual, not a
prudent preparation for future restores.

Keith

-- 
Keith Lofstrom          kei...@keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs
_______________________________________________
PLUG mailing list
PLUG@lists.pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Reply via email to