I've heard that there's a new rsyncd client in the works, which might alleviate
the frequent and severe problems encountered with rsync and high memory
consumption on clients with many files. The "solution" now seems to be to split
the
backup into numerous rsync sets, with each one consuming less memory.
I've got another suggestion for dealing with the problem, which would work with
the stock rsync client (and potentially be applicable to tar and samba backups
as well). This suggestion is to dynamically build "include" and "exclude" lists
to pass to rsync, and to do multiple (serial) calls to rsync to backup all
files.
In detail:
Before doing the backup, build a data structure representing all the
directories in the filesystem, and the number of files per directory.
In
building the data structure, any directories specified in the
configuration
file "exclude" list would be skipped. Then, apply an algorithm to the
data,
taking into account:
the amount of memory on the server
the amount of free memory on the server
the number of directories
the number of files
to try to roughly split up all the directories into sets, sized based
on the
amount of memory in the server. The algorithm should be weighted to
group
directories as high "up" in the tree as possible. For example, it's
better to
backup all of "/var" than to combine "/var/spool/cron" and
"/usr/local/src" in
one set and the remainder of "/var" in another backup set,
even if doing all of "/var" has slightly more files (and more memory
usage)
than the alternative.
In addition, the algorithm could be weighted to give a small preference
to
combining directory trees from different physical devices, in order to
improve
performance by reducing I/O wait times.
The algorithm for determining the ideal number of files per backup
(and, by
implication, which directories will be grouped together), doesn't need
to be
very sophisticated. There's no need to turn this into a true
"knapsack problem" and attempt to reach optimal backups sets, as long
as
there's real improvement on backing up all files in a single rsync
session. I think that putting fairly simple logging into BackupPC,
to record available memory before a backup begins, the number of files
backed-up, and the time the backup takes (possibly scaling for the
speed of the
network interface) would generate enough data (across the diverse
configurations of clients where BackupPC is installed) to get some very
good
empirical constants for the algorithm.
The time savings by doing smaller backups, which also cause less of an impact
on both the backup client and server, should be far greater than the time
required to get the filesystem data and build the set of individual rsync
commands (excludes and includes).
Since BackupPC already does and excellent job of "filling" individual backups
into a single "virtual full" backup that can easily be browsed for restoring,
it shouldn't matter to the users that the backup sets are dynamically
generated, as long as the user-specified includes and excludes are obeyed.
This scheme offers several advantages:
It's dynamic, and will automatically adjust for changes in filesystem
layouts or the number of files, and for the amount of physical memory
and even adjusting for the load (free memory) on the backup client and
server.
It's maintenance-free on the part of users. There's no need to create
multiple rsync "targets", making sure that they are identically named
on the client and server, and trying to balance backup sizes per target.
It would work with existing implementations of rsync, so any issues
with the future rsync daemon that's supposed to be embedded in BackupPC
would be avoided. Similarly, the dynamic backup set partitioning could
also be applied to the embedded rsync daemon, when it's ready.
I'm very interested in hearing your feedback about this proposal.
Mark
----
Mark Bergman
[EMAIL PROTECTED]
Seeking a Unix or Linux sysadmin position local to Philadelphia or via
telecommuting
http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40merctech.com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/