Re: [BackupPC-users] Problems with hardlink-based backups...

dan Sat, 22 Aug 2009 20:31:59 -0700

Unfortunately, every backup option you have has some limitations or
imperfections.  Hardlinks have thier pros and cons.  Really, there are only
a few ways of doing incremental managed backups.  Hardlinks, Diff files, and
Diff file lists, SQL.  Hardlinks are nice because they are inexpensive.
Looking at the directory contents of a backup that is using hard links
requires no overhead because of the hardlinks.  Diff files and Diff file
lists(one being where a diff is taken of each individual file and only the
changes are stored and a diff file list being only storing those files that
have changed) requires an algoryth to recurse other directories that hold
the real data and overlay the backup on the previous one.


The only option that is more efficient than hardlinks would really be
storing files in SQL and also storing an MD5, then linking the rows in SQL.
Very similar to a hardlink but instead its just a row pointer.  This would
be many times faster than doing hardlinks in a filesystem because SQL
selects in a heirarchy based on significant data.  It would be like backuppc
only having one host with one backup on it when you are looking at the web
interface.  All the other hosts and backups etc are already excluded.

SQL file storage for backuppc has been discussed extensively on this list
and suffice it to say that opinions are very split and for good reason.
SQL(mysql specifcally but applies to all) is much much better at some tasks
than a traditional filesystem(searching for data!, orders of magnitude
faster) but a filesystem is also much much better at simply storing files.
Some hybrid could take into account the pros of each such as storing all of
the pointer data in mysql and storing the actual files as their MD5 names on
a filesystem.  simply md5 a file, push the md5 off to mysql with the host
and backup date, filename, and file path and write the file to the
filesystem.  Incremental backups would MD5 a file, search the database for
the MD5, if found then write a pointer to that entry and if not write a new
entry for the MD5 of the file, the hostname, file path and file name , and
the backup number(or date).  All the files would just be stored as their MD5
name.  Recovering the files would be less transparent but would only require
an SQL to pull the list of files based on hostname and backup number and
then pull those files, renamed, into a zip or tar file.



On Mon, Aug 17, 2009 at 5:52 AM, David <wizza...@gmail.com> wrote:

> Hi there.
>
> Firstly, this isn't a backuppc-specific question, but it is of
> relevance to backup-pc users (due to backuppc architecture), so there
> might be people here with insight on the subject (or maybe someone can
> point me to a more relevant project or mailing list).
>
> My problem is as follows... with backup systems based on complete
> hardlink-based snapshots, you often end up with a large number of
> hardlinks. eg, at least one per server file, per backup generation,
> per file.
>
> Now, this is fine most of the time... but there is a problem case that
> comes up because of this.
>
> If the servers you're backing up, themselves have a huge number of
> files (like, hundreds of thousands or millions even), that means that
> you end up making a huge number of hardlinks on your backup server,
> for each backup generation.
>
> Although inefficient in some ways (using up a large number of inode
> entries in the filesystem tables), this can work pretty nicely.
>
> Where the real problem comes in, is if admins want to use 'updatedb',
> or 'du' on the linux system. updatedb gets a *huge* database and uses
> up tonnes of cpu & ram  (so, I usually disable it). And 'du' can take
> days to run, and make multi-gb files.
>
> Here's a question for backuppc users (and people who use hardlink
> snapshot-based backups in general)... when your backup server, that
> has millions of hardlinks on it, is running low on space, how do you
> correct this?
>
> The most obvious thing is to find which host's backups are taking up
> the most space, and then remove some of the older generations.
>
> Normally the simplest method to do this, is to run a tool like 'du',
> and then perhaps view the output in xdiskusage. (One interesting thing
> about 'du', is that it's clever about hardlinks, so doesn't count the
> disk usage twice. I think it must keep a table in memory of visited
> inodes, which had a link count of 2 or greater).
>
> However, with a gazillion hardlinks, du takes forever to run, and has
> a massive output. In my case, about 3-4 days, and about 4-5 GB output
> file.
>
> My current setup is a basic hardlink snapshot-based backup scheme, but
> backuppc (due to it's pool structure, where hosts have generations of
> hardlink snapshot dirs) would have the same problems.
>
> How do people solve the above problem?
>
> (I also imagine that running "du" to check disk usage of backuppc data
> is also complicated by the backuppc pool, but at least you can exclude
> the pool from the "du" scan to get more usable results).
>
> My current fix is an ugly hack, where I go through my snapshot backup
> generations (from oldest to newest), and remove all redundant hard
> links (ie, they point to the same inodes as the same hardlink in the
> next-most-recent generation). Then that info goes into a compressed
> text file that could be restored from later. And after that, compare
> the next 2-most-recent generations and so on.
>
> But yeah, that's a very ugly hack... I want to do it better and not
> re-invent the wheel. I'm sure this kind of problem has been solved
> before.
>
> fwiw, I was using rdiff-backup before. It's very du-friendly, since
> only the differences between each backup generation is stored (rather
> than a large number of hardlinks). But I had to stop using it, because
> with servers with a huge number of files it uses up a huge amount of
> memory + cpu, and takes a really long time. And the mailing list
> wasn't very helpful with trying to fix this, so I had to change to
> something new so that I could keep running backups (with history).
> That's when I changed over to a hardlink snapshots approach, but that
> has other problems, detailed above. And my current hack (removing all
> redundant hardlinks and empty dir structures) is kind of similar to
> rdiff-backup, but coming from another direction.
>
> Thanks in advance for ideas and advice.
>
> David.
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus
> on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> BackupPC-users mailing list
> BackupPC-users@lists.sourceforge.net
> List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki:    http://backuppc.wiki.sourceforge.net
> Project: http://backuppc.sourceforge.net/
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-users] Problems with hardlink-based backups...

Reply via email to