Re: [BackupPC-users] Filling Backups

Jason M. Kusar Mon, 30 Apr 2007 20:26:59 -0700

Holger Parplies wrote:
> Hi,
>
> Jason M. Kusar wrote on 30.04.2007 at 17:13:06 [[BackupPC-users] Filling 
> Backups]:
>   
>> I'm currently using BackupPC to back up our network.  However, I have 
>> one server that has over a terrabyte of data.  I am currently running a 
>> full backup that has been running since Friday morning.  I have no idea 
>> how much more it has to go.
>>
>> My question is if there is a way to avoid ever having to do a full 
>> backup again.  Most of this data will never change.  Is there a way to, 
>> for example, have every 7th incremental filled to become a full backup 
>> instead of actually performing a full backup?
>>     
>
> no, not really. To answer this question, I have to go into a bit of detail.
>
> With conventional backup systems, the most important difference between full
> and incremental backups is probably storage requirement. You don't want to
> back up 1 TB of data every day if only a few MB actually change. That would
> obviously be a huge waste of tape (or maybe disk) space.
>
> With BackupPC, due to the pooling mechanism, this is not really a concern.
> Unchanged files take up virtually no space even in full backups (only the
> directory entries).
>
> The next consideration with a network based backup solution (vs. backup to
> local tape media) would be network bandwidth. With the rsync transport, the
> cost for a full backup is again virtually the same as for an incremental
> (only some block checksums more) - at least in terms of used network
> bandwidth.
>
> So, what further benefit can be gained, or, put differently, why have
> incremental backups at all?
>
> It turns out you can save a significant amount of time, CPU power and disk
> I/O bandwidth if you are less strict about determining what has changed
> since the last backup.
>
> - For tar type transports (tar and smb) you can base your decision on the
>   timestamp(s) of files only. Instead of reading (and, in this case,
>   transferring over the network) all of the data, you only need to scan
>   file metadata (directories, inodes) to determine what has changed.
>   Problem: you miss new files with old timestamps (eg. moved from one
>   directory to another), and there is no way to figure out which files have
>   been deleted, because on the host where this decision would have to be
>   made, there is no trace left of them.
>
> - For rsync type transports (rsync and rsyncd) you can base your decision
>   on all of the file metadata and the file lists on server and client side.
>   This way, you will even catch files that have changed size (but not
>   timestamp), or were moved into their location. You will also notice
>   deleted files (though I still need to understand how this information is
>   stored in the backup - probably in the attrib files?).
>   Still, there is an odd chance of missing changed files with identical
>   timestamp and size (anyone who doesn't believe this, consider two files
>   created in the same second with the same size/ownership/permissions and
>   then swapped with 'mv' ...).
>
>
> What all of this means is: incremental backups are an optimization with the
> - slight - possibility of getting things wrong. Note that the same is true
> in "conventional" backup systems, as they also rely on timestamps (or have
> you ever heard of a backup system that reads in the last full backup from
> tape in order to figure out which files need to be included in the new
> incremental? ;-). tar type transports are more prone to make mistakes, but
> even rsync type transports are not immune. Usually nothing bad will happen,
> and you get considerable savings. But it is considered a good idea to make
> sure once in a while that everything is really as it should be, i.e. do a
> full backup.
>
> In fact, the implementation of BackupPC relies on this by making a full
> backup the only absolute reference point that does not rely on another
> backup:
>
> 1. Each incremental is relative to the previous incremental of next lower
>    level (with level 0 being a full backup). If you keep doing only level 1
>    incremental backups, you will be re-transferring everything that has
>    changed since the last full backup - more and more data from backup to
>    backup.
> 2. If you keep doing incrementals of increasing level (a chain as in:
>    day 0 - full backup, day 1 - level 1 incremental, ... day N - level N
>    incremental), they will be more expensive from day to day to merge into a
>    view of the latest backup. This needs to be done (at least) for viewing a
>    backup in the web interface, for doing the next level incremental based
>    on it, and for restoring the backup.
>    Also, all backups will need to be kept, because other backups are based
>    on them.
>
> So, there is really no way around doing full backups every now and then.
>
>   
Ok, that helps explain quite a bit that I was unclear on.  So, just to 
make sure I'm absolutely clear, doing a full backup doesn't need to 
transfer all the unchanged files because they still exist in the pool, 
correct?  But how does RsyncP tell that the file is there without 
actually transferring to the system to perform the checksum?  You 
mentioned something about using rsync with checksum caching, but I 
couldn't find any options for it.
> Now that we've covered why full backups make sense, it's up to you to agree
> or disagree.
>
> - Presuming you disagree, that means you are ultimately prepared to *risk
>   having backups* that *don't actually match your data*, because you
>   consider that risk small enough for your purposes.
>
>   Don't complain if that doesn't work out. *You have been warned.*
>
>   You can then probably change the implementation of a full backup
>   ($Conf {SmbClientFullCmd} or $Conf {TarFullArgs} - probably more difficult
>   for rsync) in that you make them do the same as an incremental backup. You
>   will need to figure out a way to pass the correct timestamp though -
>   $incrDate will *not* work for a full backup.
>   Note that deleted files will *never* be accounted for with tar type
>   transports. They will continue to exist in your backup views.
>
>   For rsync, you would probably be best off to patch the source to not
>   append '--ignore-times' for a full backup (would apply to all hosts
>   though!). You might also be able to define $Conf {RsyncClientCmd} without
>   making use of $argList.
>
>   Let me repeat that I do not consider any of this a good idea.
>
>   
> - As you hopefully agree, let's see how we can make full backups work even
>   in your case.
>
>   1. Split up your backup
>      If you can figure out which parts of your data will never change and
>      describe that adequately to BackupPC, create a separate host for that
>      part (see the ClientNameAlias setting). Do one full backup and no
>      further automatic backups. Or perhaps an occasional incremental, just
>      to check that nothing has changed or account for unforeseen changes.
>      Re-transferring changed data should be no issue.
>      For the small part of your data that changes, a "normal" backup
>      strategy should work.
>
>   
Unfortunately, this wouldn't work very well without major 
re-organization of my data.  I'd like to avoid this if possible.
>   2. Use multilevel incrementals to get a balance between retransmitting
>      data and merging views
>      If you can tolerate one full backup every three months, you might do a
>      monthly level 1 incremental, weekly level 2 incrementals and daily
>      level 3 incrementals, though 16/4/1 day intervals would probably be
>      better.
>
>   
I'll play around with this.  Files are only rarely deleted or moved 
(though it does happen occasionally), so infrequent fulls should be ok.
>   3. Use rsync (with checksum caching)
>      If you are rarely going to do full backups, you probably want the
>      advantages of rsync transport (catching files with old timestamps and
>      deleted files - yes, even on rsync incrementals).
>      If your bottleneck is network bandwidth and your hosts are fast enough
>      (meaning both the BackupPC server and the client host, both CPU-wise
>      and disk-wise), rsync might even speed up your full backups so much
>      that you can do them more often. Reading 1 TB from hard disk is still
>      going to take hours even in the best case, of course.
>
>   
I'm using rsync for my largest servers, but I'm not clear on how to 
enable checksum caching.  The servers sit on a gigabit network, but the 
backup server cpu sits at 100% while the backup is going on.  The server 
I'm backing up is significantly faster and only uses about 15%.  The 
backup server as a gig of memory.  The backed up server as 2 gigs.
> What transport method are you currently using? You might want to try out how
> long an incremental takes you. How many files is your 1 TB of data made up
> of? How much memory do the BackupPC server and the client machine have?
>
>   
I'm also using smb for my client workstations, but for the most part, 
they have a small enough dataset that even a full backup takes less than 
45 minutes.
>> Also, a slightly unrelated question.
>>     
>
> You've probably already noticed that it's actually very related.
>
>   
>> How does CGI interface handle deleted files hen viewing incremental
>> backups for the SMB transport mechanism?  Does it just wait for the
>> next full backup so that the files disappear?
>>     
>
> Yes, it must, as there is no way to notice the disappearance of files on an
> incremental.
>
>   
>> Is this any different for the rsync mechanism?
>>     
>
> Yes, as pointed out.
>
> You wrote later:
>   
>> I assume also then that files deleted off of the server will not be
>> deleted off the backup since the default rsync options don't include
>> any of the delete options.
>>     
>
> The delete options are not needed, because the remote rsync instance (on the
> backed up client) is not supposed to delete anything. The local side is
> File::RsyncP, not rsync.
>
> Regards,
> Holger
>
>   
Thanks for the input.  Things are starting to become more clear.
--Jason


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Re: [BackupPC-users] Filling Backups

Reply via email to