Re: [BackupPC-users] Filling Backups

Holger Parplies Mon, 30 Apr 2007 19:50:17 -0700

Hi,

Jason M. Kusar wrote on 30.04.2007 at 17:13:06 [[BackupPC-users] Filling 
Backups]:
> I'm currently using BackupPC to back up our network.  However, I have 
> one server that has over a terrabyte of data.  I am currently running a 
> full backup that has been running since Friday morning.  I have no idea 
> how much more it has to go.
> 
> My question is if there is a way to avoid ever having to do a full 
> backup again.  Most of this data will never change.  Is there a way to, 
> for example, have every 7th incremental filled to become a full backup 
> instead of actually performing a full backup?


no, not really. To answer this question, I have to go into a bit of detail.

With conventional backup systems, the most important difference between full
and incremental backups is probably storage requirement. You don't want to
back up 1 TB of data every day if only a few MB actually change. That would
obviously be a huge waste of tape (or maybe disk) space.

With BackupPC, due to the pooling mechanism, this is not really a concern.
Unchanged files take up virtually no space even in full backups (only the
directory entries).

The next consideration with a network based backup solution (vs. backup to
local tape media) would be network bandwidth. With the rsync transport, the
cost for a full backup is again virtually the same as for an incremental
(only some block checksums more) - at least in terms of used network
bandwidth.

So, what further benefit can be gained, or, put differently, why have
incremental backups at all?

It turns out you can save a significant amount of time, CPU power and disk
I/O bandwidth if you are less strict about determining what has changed
since the last backup.

- For tar type transports (tar and smb) you can base your decision on the
  timestamp(s) of files only. Instead of reading (and, in this case,
  transferring over the network) all of the data, you only need to scan
  file metadata (directories, inodes) to determine what has changed.
  Problem: you miss new files with old timestamps (eg. moved from one
  directory to another), and there is no way to figure out which files have
  been deleted, because on the host where this decision would have to be
  made, there is no trace left of them.

- For rsync type transports (rsync and rsyncd) you can base your decision
  on all of the file metadata and the file lists on server and client side.
  This way, you will even catch files that have changed size (but not
  timestamp), or were moved into their location. You will also notice
  deleted files (though I still need to understand how this information is
  stored in the backup - probably in the attrib files?).
  Still, there is an odd chance of missing changed files with identical
  timestamp and size (anyone who doesn't believe this, consider two files
  created in the same second with the same size/ownership/permissions and
  then swapped with 'mv' ...).


What all of this means is: incremental backups are an optimization with the
- slight - possibility of getting things wrong. Note that the same is true
in "conventional" backup systems, as they also rely on timestamps (or have
you ever heard of a backup system that reads in the last full backup from
tape in order to figure out which files need to be included in the new
incremental? ;-). tar type transports are more prone to make mistakes, but
even rsync type transports are not immune. Usually nothing bad will happen,
and you get considerable savings. But it is considered a good idea to make
sure once in a while that everything is really as it should be, i.e. do a
full backup.

In fact, the implementation of BackupPC relies on this by making a full
backup the only absolute reference point that does not rely on another
backup:

1. Each incremental is relative to the previous incremental of next lower
   level (with level 0 being a full backup). If you keep doing only level 1
   incremental backups, you will be re-transferring everything that has
   changed since the last full backup - more and more data from backup to
   backup.
2. If you keep doing incrementals of increasing level (a chain as in:
   day 0 - full backup, day 1 - level 1 incremental, ... day N - level N
   incremental), they will be more expensive from day to day to merge into a
   view of the latest backup. This needs to be done (at least) for viewing a
   backup in the web interface, for doing the next level incremental based
   on it, and for restoring the backup.
   Also, all backups will need to be kept, because other backups are based
   on them.

So, there is really no way around doing full backups every now and then.


Now that we've covered why full backups make sense, it's up to you to agree
or disagree.

- Presuming you disagree, that means you are ultimately prepared to *risk
  having backups* that *don't actually match your data*, because you
  consider that risk small enough for your purposes.

  Don't complain if that doesn't work out. *You have been warned.*

  You can then probably change the implementation of a full backup
  ($Conf {SmbClientFullCmd} or $Conf {TarFullArgs} - probably more difficult
  for rsync) in that you make them do the same as an incremental backup. You
  will need to figure out a way to pass the correct timestamp though -
  $incrDate will *not* work for a full backup.
  Note that deleted files will *never* be accounted for with tar type
  transports. They will continue to exist in your backup views.

  For rsync, you would probably be best off to patch the source to not
  append '--ignore-times' for a full backup (would apply to all hosts
  though!). You might also be able to define $Conf {RsyncClientCmd} without
  making use of $argList.

  Let me repeat that I do not consider any of this a good idea.

- As you hopefully agree, let's see how we can make full backups work even
  in your case.

  1. Split up your backup
     If you can figure out which parts of your data will never change and
     describe that adequately to BackupPC, create a separate host for that
     part (see the ClientNameAlias setting). Do one full backup and no
     further automatic backups. Or perhaps an occasional incremental, just
     to check that nothing has changed or account for unforeseen changes.
     Re-transferring changed data should be no issue.
     For the small part of your data that changes, a "normal" backup
     strategy should work.

  2. Use multilevel incrementals to get a balance between retransmitting
     data and merging views
     If you can tolerate one full backup every three months, you might do a
     monthly level 1 incremental, weekly level 2 incrementals and daily
     level 3 incrementals, though 16/4/1 day intervals would probably be
     better.

  3. Use rsync (with checksum caching)
     If you are rarely going to do full backups, you probably want the
     advantages of rsync transport (catching files with old timestamps and
     deleted files - yes, even on rsync incrementals).
     If your bottleneck is network bandwidth and your hosts are fast enough
     (meaning both the BackupPC server and the client host, both CPU-wise
     and disk-wise), rsync might even speed up your full backups so much
     that you can do them more often. Reading 1 TB from hard disk is still
     going to take hours even in the best case, of course.

What transport method are you currently using? You might want to try out how
long an incremental takes you. How many files is your 1 TB of data made up
of? How much memory do the BackupPC server and the client machine have?

> Also, a slightly unrelated question.

You've probably already noticed that it's actually very related.

> How does CGI interface handle deleted files hen viewing incremental
> backups for the SMB transport mechanism?  Does it just wait for the
> next full backup so that the files disappear?

Yes, it must, as there is no way to notice the disappearance of files on an
incremental.

> Is this any different for the rsync mechanism?

Yes, as pointed out.

You wrote later:
> I assume also then that files deleted off of the server will not be
> deleted off the backup since the default rsync options don't include
> any of the delete options.

The delete options are not needed, because the remote rsync instance (on the
backed up client) is not supposed to delete anything. The local side is
File::RsyncP, not rsync.

Regards,
Holger

P.S.: You don't want to do only full backups with 1 TB of data. That advice
      was not good.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Re: [BackupPC-users] Filling Backups

Reply via email to