Thanks for the replies. Firstly, I think I should reiterate a few things I mentioned in the first post.
I haven't actually used BackupPC yet, mainly read through it's docs, and trying to judge how well it and it's storage system would work in our environment. I'm mainly asking questions on this list first, to get an idea of how well it handles the kind of issues I've experienced so far (with things like hardlinks to huge filesystems), before I spend more time playing with BackupPC and looking into migrating our backups to it. And like I said before, this isn't a BackupPC-specific complaint, more a general problem with hardlink-based backup systems (as opposed to rdiffs, or various other schemes). So I'm checking how sysadmins typically handle these kinds of issues. Also, I'm not too experienced with backup "best practices", methodologies, etc. Still learning, and seeing what works best. And heh, our (relatively small) company didn't even have a real backup system before, and I'm still the only person here that seems to take them seriously >_>. Fortunately, the boss has started seeing the light (after a near disaster in the server room), and acquired some more hardware. But nobody besides me seems to have time to actually setup things and make sure they're running. And I'm not even one of the network admins/tech support, I'm actually a programmer and I was never actually asked to work on the backups ^^; The actual network admins/tech support don't really know much about backups D: (or have time to work with them). Anyway, hopefully the above will give you a better idea of my angle on this. I'm not trying to criticize BackupPC, but rather figure out what kind of backup scheme is going to work here (and be easy to admin/diagnose/hack/etc), whether it is BackupPC, or something else (that may or may not use hardlinks). On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<lesmikes...@gmail.com> wrote: > > Why not just exclude the _TOPDIR_ - or the mount point if this is on its > own filesystem? > Because most of the interesting files on the backup server (at least in my case), are the files being backed up. I'm a lot more interested in being able to quickly find those files, than random stuff under /etc, /usr, etc. > > There's not a good way to figure out which files might be in all of your > backups and thus not help space-wise when you remove any instance(s) of > it. But the per-host, per-run stats where you can see the rate of new > files being picked up and how much they compress is very helpful. > Thanks for this info. At least with per-host stats, it's easier to narrow down where to run du if I need to, instead of over the entire backup partition. A couple of random questions: 1. How well does BackupPC work when you manually make changes to the pool behind it's back? (like removing a host, or some of the host's history, via the command line). Can you make it "resync/repair" it's database? 2) Is there a recommended approache for "backing up" BackupPC databases? In case they go corrupt and so on. Or is a simple rsync safe? 3) Is it possible to use BackupPC's logic on the command-line, with a bunch of command-line arguments, without setting up config files? That would be awesome for scripting and so on, for people who want to use just parts of it's logic (like the pooled system for instance), rather than the entire backup system. I tend to prefer that kind of "unix tool" design. > > Of course, but you do it by starting with a smaller number of runs than > you expect to be able to hold. Then after you see that the space > consumed is staying stable you can adjust the amount of history to keep. > Ah right. I think this is a fundamental difference in approach. With the backup systems I've used before, space usage is going to keep growing forever, until you take steps to fix it. Either manually, or by some kind of scripting, and so far I haven't added scripting, so I rely on du to know where to manually recover space. Basically, I was using rdiff-backup for along time. That tool keeps all the history, until you run it with a command-line argument to prune the oldest revisions. And also, I don't see a great need to pro-actively recover space most of the time. The large majority of servers/users/etc have a relatively small amount of change. So it's kind of cool to be able to get *any* of the earlier daily snapshots, for the last few years. Although ironically, the servers with the largest amount of churn (and harddrive usage on backup server), are the ones you'd actually want to keep old versions for (like yearlies, monthlies, etc). But with rdiff-backup, that isn't really possible without some major repo surgery :-). You end up throwing away all the oldest versions when space runs low. Also, I'm influenced by revision control tools, like git/svn/etc. I don't like to throw away old versions, unless it's really necessary. And, if you have a lot of harddrive space on the backup server, then may as well actually make use of it, to store as many versions as possible. And then only remove oldest versions where needed. The above backup philosophy (based partly on rdiff-backup limitations) has served me well so far, but I guess I need to unlearn some of it, particularly if I want to use a hardlink-based backup system. > > One other thing - backuppc only builds a complete tree of links for full > backups which by default run once a week with incrementals done on the > other days. Incremental runs build a tree of directories but only the > new and changed files are populated, with a notation for deletions. The > web browser and restore processes merge the backing full on the fly and > the expire process knows not to remove fulls until the incrementals that > depend on it have expired as well. That, and the file compression might > take care of most of your problems. Ah, very interesting info, thanks. I read the info on incrementals in the docs, and mainly picked up that "rsync is a good thing" :-) AA couple of questions, pardon my noobiness: If rsync is used, then what is the difference between an incremental and a full backup? ie, do "full" backups copy all the data over (if using rsync), or just the changed files? And, what kind of disadvantage is there if you only do (rsync-based) incrementals and don't ever make full backups? On Tue, Aug 18, 2009 at 5:49 PM, Jon Craig<cannedspam.c...@gmail.com> wrote: > A personal desire on your part to use a specific tool to get > information that is presented in other ways hardly constitues a > problem with BackupPC. Again, I'm not criticizing BackupPC specifically. And indeed it seems that BackupPC has ways which can reduce the problem. Specifically incremental backups, as opposed to a large number (hundreds/thousands) of "full" snapshot directories, each containing a huge number of hardlinks (possibly millions), for several such servers. My angle is that Linux sysadmins have certain tools they like to use, and saying they can't use them effectively due to the backup architecture is kind of problematic. I guess though, that the philosophy behind rdiff-backup (keep every single version, until you want to start removing oldest) isn't really compatible with BackupPC, or other schemes that keep an actual filesystem entry for every version of every file, even when there are no changes in those files. Probably I need to think more about using a more traditional scheme (keep a fixed number of backups, X daily, Y weekly, Z monthly, etc), instead of "keep versions forever, until you need to start recovering harddrive space". > The linking structure within BackupPC is the > "magic" behind deduping files. That it creates a huge number of > directory entries with a resulting smaller number of inode entries is > the whole point. Yeah, I like that. But the problem I see is this: (From BackupPC docs) "Therefore, every file in the pool will have at least 2 hard links (one for the pool file and one for the backup file below __TOPDIR__/pc). Identical files from different backups or PCs will all be linked to the same file. When old backups are deleted, some files in the pool might only have one link. BackupPC_nightly checks the entire pool and removes all files that have only a single link, thereby recovering the storage for that file." Therefore, if you want to keep tonnes of history (like, every day for the past 3 years), for a server with lots of files, then it sounds like you need to actually have a huge number of filesystem entries. I think if I wanted to use BackupPC, and still be able to use du and friends effectively, I'd need to do some combination of: 1) Use incrementals for most of the backups, to limit the number of hardlinks created, as Les Mikesell described. 2) Stop trying to keep history for every single day for years (rather keep 1 for the last X days, last Y weeks, Z months, etc). This would also mean having to spend less time managing space. Although at the moment it only comes up every few weeks/months, and had been pretty fast with du & xdiskusage, at least until I switched over from rdiff-backup to a "make a hardlink snapshot every day" process :-(. > Use the status pages to determine where your space > is going. It gives you information about the apparent size (full size > if you weren't de-duping") and the unique size (that portion of each > backup that was new. This information is a whole lot more useful that > whatever your gonna get from DU. DU takes so long because its a dumb > tool that does what its told and you are in effect telling it to > iterate accross each server multiple times (1 per retained backup) for > each server you backup. If you did this against the actual clients > the time would be similiar to doing it against BackupPC's topdir. And furthermore, hardlink-based storage does cause ambiguous du output, even if the time it took to run wasn't an issue. Which is another thing about hardlink-based backups which annoys me (compared to when I was using rdiff-backup), and one of the reasons why I'm currently running my own very hackish "de-duping" script on our backup server. Nice that BackupPC maintains these stats separately. Although kind of annoying (imo), that you have to go through it's frontend to see this info, rather than being able to tell from standard linux commands (for scripting purposes and so on). And also it bothers me that those kind of stats can potentially go out of synch with the harddrive (maybe you delete part of the pool by mistake). Is there a way to make BackupPC "repair" it's database, by re-scanning it's pool? Or some kind of recommended procedure for fixing problems like this? > > As a side note are you letting available space dictate you retention > policy? It sounds like you don't want to fund the retention policiy > you've specified otherwise you wouldn't be out of disk space. Buy > more disk or reduce your retention numbers for backups. > More like, there wasn't a backup or retention policy to begin with D:. I hacked together some scripts that use rdiff-backup and other tools, and then added them to the backup server crontab. And since we have a fairly large backup server (compared to the servers being backed up), I let the older backups build up for a while to take advantage of the space, and then free a chunk of space manually when the scripts email me about space issues. But now I can't "free a chunk of space manually" that easily any more, since "du" doesn't work :-(. At least thanks to the discussions in this thread, I have a few more ideas for my own scripts, even if I don't use BackupPC in the end. > Look at the Host Summary page. Those servers with the largest "Full > Size" or a disspoportionate number of retained fulls/incrementals are > the hosts to focus pruning efforts on. Now select a candidate and Ah, thanks. This is very useful info. So you can find which files/transfers/etc caused a given host to use a huge amount of storage. > Voila', you've put your system on a diet, but beware, you do this once > and management will expect you to keep solving their under resourced > backup infrastructure by doing it again and again. Well, the good news is that nobody here seems to care about the backups much, until the moment they're needed. The fact we have them at all is kind of a bonus D:. At least I'm starting to get the boss (we're a pretty small company) on my side. Just that nobody besides myself has time to work on things like this. Anyway, thanks again for the replies. This thread has been educational so far :-) David. PS: Random question: Does backuppc have tools for making offsite, offline backups? Like copying a subset of the recent BackupPC backups over to a set of external drives (in encrypted format) and then taking the drives home or something like that. Or alternately, are there recommended tools for this? I made a script for this, but want to see how people here usually handle this. ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/