Re: [BackupPC-users] Grouping hosts and pool

Holger Parplies Thu, 07 Jun 2007 05:12:27 -0700

Hi,

Mark Sopuch wrote on 07.06.2007 at 13:36:55 [Re: [BackupPC-users] Grouping 
hosts and pool]:
> Jason M. Kusar wrote:
> > Mark Sopuch wrote:
> >> I'd like to group data (let's just say dept data) from certain hosts 
> >> together (or actually to seperate some from others) to a different 
> >> filesystem and still keep the deduping pool in a common filesystem. 
> >> [...]
> > Yes, hard links do not work across filesystems.
> > [...]
> [...] my concerns lie mainly with certain types of 
> hosts (data) encroaching quite wildly into the shared allocated space 
> under DATA/... thus leaving less room for the incoming data from other 
> hosts. It's a space budgetting and control thing. [...] I am not sure how
> any other quoting schemes would work to provide similar capability for soft 
> and hard quota if they are in the same fs and usernames are not stamped 
> around in DATA/... to differentiate such things to those other quota'ing 
> systems. Sure I want to back everything up but I do not want the 
> bulkiest least important thing blocking a smaller top priority backup 
> getting space to write to when there's a mad run of new data.
> 
> Hope I am being clear enough. Thanks again.


I believe you are being clear enough, but I doubt you have a clear enough
idea of what you actually want :-).

If multiple hosts/users share the same (de-duplicated) data, which one would
you want it to be accounted to? If a new low-priority backup creates a huge
amount of "new" data (in the sense that it was not in any of its previous
backups) which is, though, already in the pool (from backups of
high-priority hosts), should that backup fail to link to that data, because
it exceeds its quota? This could happen, for example, when a user downloads
the OpenOffice.org or X11 sources that someone else has also previously
downloaded.
What happens when the high-priority backups using the data expire and only
the low-priority backups remain? Should the low-priority user then be over
quota and have his existing backup removed? What happens to incremental
backups based on that backup?

You see, it's not a simple matter to combine de-duplication and quotas. The
whole point of de-duplication is to not use up disk space for the same data
multiple times. The whole point of quotas is to divide up disk space
according to fixed rules. While it may make sense to add the cost of shared
files to each user's quota (thus counting them multiple times and
consequently having the sum of all used quotas be (possibly many times) larger
than the actual amount of used disk space), that is probably not the way any
conventional quota system works, because it's concerned with dividing up
real physical disk space.

What you probably want is something like 'du $TOPDIR/pc/$hostname' gives you:
take into account de-duplication savings *within* the one host but not those
*between different* hosts. There is no way to achieve this with file
ownerships, because one inode can belong to only one user, regardless of which
link you traverse to access it.

You might be relieved to hear that BackupPC never creates a temporary copy
of data it can find in the pool (so a 1 GB backup does *not* first use up 1 GB
of additional space and then delete duplicate data). You need to be aware
however, that files not already in the pool are added to the pool by
BackupPC_link, which is run after the backup. That means they are not
available for de-duplication during the backup itself yet. If one backup
includes ten instances of an identical new 1 GB file, it will first store
10 * 1 GB. BackupPC_link will then resolve the duplicates. After
BackupPC_link, only 1 * 1 GB will be used.

The only simple thing I can think of in the moment is to check disk usage with
'du' *after the backup* (after the link phase, if you want to be correct) and
then react to an over-quota situation.

Problem 1: What do you do? Delete the complete backup? Or try to 'trim' it
           down to what would fit? How? Make sure there's no subsequent
           backup running yet that is using the data ...
Problem 2: DumpPostUserCmd runs before BackupPC_link, so the calculated disk
           usage would not be strictly correct if you do it from there.
Problem 3: The corrective measure is taken after the fact. A high priority
           backup may already have been skipped due to DfMaxUsagePct being
           exceeded.
Problem 4: I haven't got a large pool where I could test, but I would expect
           'du' to suffer from the same large-amount-of-files-with-more-than-
           one-hardlink and inodes-spread-out-over-disk problems as pool
           copying encounters. Expect it to be slow and test first if it
           works at all.

As a side note, I'm not sure how well BackupPC handles full pool file
systems. If there were no problems at all, DfMaxUsagePct would not be
needed. A quota system would, in fact, deny BackupPC access to more space on
the disk once the quota was reached, just as if the disk had run full. I doubt
it is a good idea to use a quota system not implemented within BackupPC itself
for this reason. You can, of course, extend BackupPC to include whatever quota
mechanism you want to implement, but I doubt that qualifies as "simple" :-).

If you don't expect much duplicate data between hosts (or groups of hosts),
the best approach is probably really to run independent instances of
BackupPC.

Regards,
Holger

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Re: [BackupPC-users] Grouping hosts and pool

Reply via email to