Re: [BackupPC-devel] BackupPC 4.0 features - attribute file and backup storage

Craig Barratt Wed, 02 Mar 2011 23:26:20 -0800

Jeffrey,

> Were there previous topics that you posted to the mailing list that I
> missed? :P


They might have arrived out of order.

> Hopefully, the code will expose building-block subroutines for
> reconstructing the current backup from the reverse time deltas so that
> people can create their own extensions for accessing the contents of
> specific backups (e.g., like the backuppc fuse file system).

Yes, it's pretty easy to reconstruct a backup view.

> By extension, would it be possible to write all the routines in
> checksum agnostic fashion so that the choice of MD5 or sha256sum or
> whatever could be made by the user by just supplying the right perl
> calling routine or wrapper. This would make it possible for users to
> use whatever checksum they feel comfortable with -- potentially
> trading off speed for reliability

Let me think about this.  I want to match the full-file checksum
in rsync because it allows some clever things to be done.  For
example, with some hacks, it would allow a new client file to
be matched even the first time if it is in the pool and you use
the --checksum option to rsync (this requires some changes to
the server-side rsync).

> 1. Would it be possible to allow for simple user-extensions to add other
>    fields. This would make it easier and more robust to extend should the
>    need or desirability of adding other file attributes arise. 

Good idea.  Let me think about that.

> 2. Does the attrib file also presumably contain a pool file name so
>    that you can determine the pool file location?

The digest directly allows the pool name to be constructed.  The digest
includes an extension in case of collisions.

> Will there be a routine to convert legacy 3.x stored backups to 4.x
> format?

The pool effectively gets migrated as each pool file is needed for 4.x.
But the old backups are not.

> In some ways having to keep around all the 3.x hard-links would defeat
> a lot of the easy copying benefits of 4.x or would require using a
> separate partition. 

It's a good point.  However, such a utility would probably have a very
long running time.  It wouldn't be reversible, and I would be very
concerned about cases where it was killed or failed before finishing.

I was thinking I would add a utility that lists all the 4.x directories
(below $TOPDIR) that should be backed up.  Those wouldn't have hardlinks.
That could be used with a user-supplied script to backup or replicate
all the 4.x backups.

> Couple of questions:

> 1. Does that mean that you are limited to 2^14 client hard links?

No.  It's 2^14 files, each of which could, in theory subject
to memory, hold a large number of entries.

> 2. Does the hardlink attrib file contain a list of the full file paths
>    of all the similarly hard-linked files?

No.  Just like a regular file sysem the reverse lookup isn't easy.

>    This would be helpful for programs like my BackupPC_deleteFile
>    since if you want to delete one hard linked file, it would be
>    necessary presumably to decrement the nlinks count in all the
>    attrib files of the other linked files.

No, that's not necessary.  The convention is that if nlinks >= 2
in a file entry, all that means is the real data is in the inode
entry (and the nlinks could actually be 1).

>    Plus in general, it is nice
>    to know what else is hard-linked without having to do an exhaustive
>    search through the entire backup.
> 
>    In fact, I would imagine that the atribYY file would only need to
>    contains such a list and little if any other information since that
>    is all that is required presumably to restore the hard-links since
>    the actual inode properties are stored (redundantly) in each of the
>    individual attrib file representations.

True.  But you have a replicate and update all the file attributes
in that case.  So any operation (like chmod etc) would require all
file attributes to be changed.

> Other questions/suggestions:
> 1. Are the pool files now labeled with the full file md5sums?
>    If so, are you able to get that right out of the rsync checksums
>    (for protocol >=30 or so) or do they need to be computed
>    separately?

The path name of the pool file gives the md5 sum.


>    How is the (unlikely) event of an md5sum collision handled?

Chains are still used for collisions.  They are obviously unlikely
in typical use, but I have been testing it with the now well-known
files that have collisions.

>    Is it always still necessary to compare actual file contents when
>    adding a new file just in case the md5sums collide?

Yes.

>    If so, would it be better to just use a "better" checksum so that
>    you wouldn't need to worry about collisions and wouldn't have to
>    always do a disk IO and CPU consuming file content comparison to
>    check for the unlikely event of collision?

Good point, but as I mentioned above I would like to take advantage
of the fact that it is the same full-file checksum as rsync.

> 2. If you are changing the appended rsync digest format for cpool
>    files using rsync, I think it might be helpful to also store the
>    uncompressed filesize in the digest There are several use cases
>    (including verifying rsync checksums where the filesize is required
>    to determine the blocksize) where I have needed to decompress the
>    entire file just to find out its size (and since I am in the pool
>    tree I don't have access to the attrib file to know its size).

Currently 4.x won't use the appended checksums.  I'll explain how
I'm implemented rsync on the server side in another email.  I could
add that for later, but it is more complex.

>    Similarly, it might be nice to *always* have the md5sum checksum
>    (or other) appended to the file even when not using the rsync
>    transfer method. This would help with validating file
>    integrity. Even if the md5sum is used for the pool file name, it
>    may be nice to allow for an alternative more reliable checksum to
>    be stored in the file envelope itself.

Interesting idea.

> 3. In the absence of pool hard links, how do you know when a pool file
>    can be deleted? Is there some table where the counts are
>    incremented/decremented as new backups are added or deleted?

Yes - see my other emails (assuming I sent them all out).

> 4. Do you have any idea whether 4.0 will be or less resource
>    intensive (e.g., disk IO, network, cpu) than 3.x when it comes to:
>               - Backing up
>               - Reconstructing deltas in the web interface (or fuse 
> filesystem)
>               - Restoring data

I hope it will be a lot more efficient, but I don't have any data
yet.  There are several areas where efficiency will be much better.
For example, with reverse-deltas, a full or incremental backup with
no changes shouldn't need any significant disk writes.  In contrast
3.x has to create a directory tree and in the case of a full make
a complete set of hardlinks.

Also, rsync on the server side will be based on a native C rsync.
I'll send another email about that.

> Most importantly, 4.0 sounds great and exciting!

I hope to get some time to work on it again.  Unfortunately I haven't
made any progress in the last 4 months.  Work has been very busy.

Craig

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-devel] BackupPC 4.0 features - attribute file and backup storage

Reply via email to