Re: [BackupPC-users] Using rsync for blockdevice-level synchronisation of BackupPC pools

Holger Parplies Fri, 11 Sep 2009 19:14:10 -0700

Hi list,

I could probably quote roughly 1GB from this discussion, or top-post and
append the whole thread for those of you who want to read it again, but I
won't.


I just want to share some thoughts that seem to be missing from this
discussion so far - for whatever use anyone can make of them.

* The BackupPC pool file system is, generally speaking, made up of file system
  blocks. No logical entity within the file system will be shifted forward or
  backward by any amount of space that is *not* an integral multiple of the
  file system block size - and probably not even that. (There may be exceptions
  with things such as reiserfs tails, but I doubt they're worth taking into
  account. Hmm, there are also directory entries - are they worth thinking
  about?)

* rsync calculates *rolling* block *checksums* in order to re-match data at
  an offset any number of bytes away. While "rolling" does not hurt (much - a
  bit of performance at the most) when applied to a whole file system, it
  provides no benefit. "checksums" may hurt, because there are bound to be
  collisions which would, if I'm not mistaken, cause a second pass across the
  "file(s)" to be done. That involves a *lot* of disk I/O if not bandwidth.
  The md5sums over individual large (non-rolling) blocks approach someone
  mentioned is bound to make much more sense for a file system than the rsync
  algorithm.

* *Data within the pool* is *never* modified. Files are created, linked to,
  and deleted. That's it. [Wait, that's wrong. rsync checksums may be added
  later, but they're *appended*, aren't they? Only once anyway.]
  A few *small* files are modified (log files, backups files; perhaps a
  .bash_history file or other things outside the scope of BackupPC).
  *Existing directories* are modified as new pool files are added and expired
  ones removed. The same applies to pc/$host directories regarding new and
  expired backups.
  *New pc/$host/$num directory hierarchies* are created.
  *Inode information* is modified heavily. Forget about the ctime, which may
  vary between file systems. The *link count* is modified, meaning the inode
  is modified. Not for every file in the pool, but for every file that was
  linked to (or unlinked from because of an expiring backup).
  Other metadata such as *block usage bitmaps* is modified.

  To sum it up, modifications since the last "backup" of the pool FS will
  consist of *new files and directories* and *changed file system metadata*.
  Presumably, your backups will consist mostly of *data* (you might want to
  check that), and a large part of that will be static. 300 GB of daily
  changes on a 540 GB pool seems extremely unlikely, 8 GB seems more like it.

* VMDK files, from my experience, do not seem to resemble raw disk images too
  closely. I only use the non-preallocated variant, and this seems to be well
  optimized for storing wide-spread writes (think "mkfs") in a small amount
  of data. A preallocated VMDK may be a completely different matter. But it's
  a proprietary format, isn't it? Is there a public spec? Do you know what
  design decisions were made and why?

  In any case, how much data do you need to fully represent the changes made
  to the virtual file system? Does the VMDK change more or less than the file
  system it represents? By what factor? Is a VMDK also a logical block array,
  or may information shift by non-blocksize distances?

* You could probably use DRBD or NBD to mirror to a partition inside a VMware
  guest, presuming you really want to do that.


All of that said, I find the approach of incrementally copying the block
device quite appealing, presuming it proves to work well (and I'm not yet
convinced that rsync is the optimal tool to copy it with). It simply avoids
some of the problems of a file-based approach, but it also has other drawbacks,
meaning it won't work for everyone (e.g. you can't change the FS type; you
need storage for the full device size and bandwidth for the initial transfer;
you may need bandwidth for a full transfer on *restore*; you'll need enough
space for the image on restore, even if only a fraction of the FS is in use;
resizing the source FS may lead to a very long incremental transfer; you can't
backup anything other than the *complete* FS the pool is on; it won't protect
you from slowly accumulating FS corruption, as you're copying that into your
backup; ...). I'm really interested in hearing about your experiences with
this, but as, for me, <backuppc-users> is currently running in degraded
read-mostly mode due to sheer volume, don't expect me to join the discussion
on a regular basis :).

Regards,
Holger

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
[email protected]
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-users] Using rsync for blockdevice-level synchronisation of BackupPC pools

Reply via email to