Re: [BackupPC-users] Pooling & Incrementals -- I'm confused

Holger Parplies Wed, 24 Jan 2007 20:42:14 -0800

Hi,

Joe Casadonte wrote on 22.01.2007 at 10:02:59 [Re: [BackupPC-users] Pooling & 
Incrementals -- I'm confused]:
> I was thinking more about this, and I guess the functional difference
> [between full and incremental backups] is in the time it takes to do the
> backup:


the time it takes to do a backup is not the function of the backup, it's an
unavoidable nuisance ;-).

> From the archive perspective, there seems not to be a difference.

That is not correct. There are at least two differences:

1.) Full backups go to more trouble to determine which files have changed and
    which haven't.
    tar/smb:  all files are transfered, BackupPC checks (full) vs.
              only newer files (based on timestamp) transfered (incremental)
    rsync(d): block checksum comparison (full) vs.
              timestamp/size comparison (incremental)
    So, you might miss changed files on an incremental backup which a full
    backup will catch. Especially, incremental tar/smb won't transfer moved
    (renamed) files since their timestamps indicate they need'nt be.
2.) A full backup constitutes a point of reference for future backups.
    Incrementals will transfer everything since the last full backup. That
    means you will generally transfer more from day to day. With rsync, a
    full backup will transfer insignificantly more than an incremental would
    (block checksums), but following incremental backups will "restart at 0".
    Thus, while a full backup will use (significantly) more resources in terms
    of CPU usage and I/O bandwidth, it may save you significant amounts of
    network bandwidth on subsequent incremental backups.
    For a more detailed comment and a note or two on multi-level incremental
    backups, see my note in the thread "Avoiding long backup times" from
    20.01.2007.

Marty wrote on 21.01.2007 at 15:33:46 [Re: [BackupPC-users] Pooling & 
Incrementals -- I'm confused]:
> The main effect, 
> as you describe, is that file corruption *may* be caught and corrected. 
> Unfortunately this only addresses corruption of the archive, and not in the 
> host.  To the contrary, corruption in the host is compounded by full backups, 
> which silently supercede the uncorrupted backups and may not be caught in
> before the uncorrupted backups expire.

Well, that issue is the same for any backup scheme. If you expire old
backups, their data is lost. With tape backup schemes, you might keep older
tapes, say one full backup from each month. With BackupPC, you might use
exponential backup expiry to achieve much the same (automatically!). True,
you can't expand disk space as easily as you can buy new tapes, but as long
as files *don't* change, keeping extra copies costs you virtually nothing.

I see your point that changed files missed on incremental backups *might*
indicate corruption on the client side, because a legitimate change would
*usually* update the modification time, but that is not always true, and it
cannot be automatically and generically decided which is the case.
It should be possible to detect and log such a circumstance, I would expect,
but I have no idea how complicated that might be.

> The last time I checked I didn't see any validation of archives in backuppc.

>From the comments in the source and a little experimenting (hey, and the
documentation :-) I get the impression that the file name is the md5sum of
the concatenation of the size of the uncompressed file and the contents of
the uncompressed file (for files < 256K that is, see BackupPC::Lib,
sub Buffer2MD5 for details). That's easy enough (algorithmically) to verify.
Expect quite some disk I/O and CPU usage though. For the fun of it, I'll
write a script that does some checking, though I wouldn't know what to do
except print a message on mismatches.

> This is compounded 
> by the problem of backing up backuppc archives, caused by the huge number of 
> hardlinks, which AFAIK is still unsolved and which not a backuppc problem per 
> se, but more likely a linux and/or rsync problem.

Backing up the complete pool system is somewhat of a problem, and it *is* a
BackupPC problem. It's not a bug in rsync, cp or tar, it's just a huge
amount of data that needs to be traversed in order to re-create the hardlink
relationships correctly (and a huge amount of disk seeks involved). With
knowledge of how the BackupPC pool system works, it is somewhat easier.
There's even BackupPC_tarPCCopy which seems to perform some highly illegal tar
voodoo by creating tar archives with hardlinks referencing files not in the
archive. I obviously wouldn't admit having tried it and thus can't tell you
how much it speeds up the process, but it's supposed to make it feasible ;-).

And then there's the archive method for creating a snapshot of a host at one
point in time (corresponding to the tapes you would keep with a tape backup
solution). As far as I know, that includes generating parity information in
addition to a tar archive, optionally even split into fixed size parts that
would fit on CDs, DVDs, FDDs ... :)

Hope that helps.

Regards,
Holger

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Re: [BackupPC-users] Pooling & Incrementals -- I'm confused

Reply via email to