Re: [BackupPC-devel] BackupPC 4.0 features - attribute file and backup storage

Jeffrey J. Kosowsky Wed, 02 Mar 2011 12:48:48 -0800

Craig Barratt wrote at about 00:33:13 -0800 on Wednesday, March 2, 2011:
 > The next topic is the attribute file format and how backups
 > are stored.


Were there previous topics that you posted to the mailing list that I
missed? :P

 > 
 > In 4.x the only information that appears in a backup tree
 > (ie: a single backup stored below $TOPDIR/pc/HOST/NNN) are
 > the attribute files, one per directory.
 > 
 > In contrast, a filled backup in 3.x includes a complete set
 > of files using hardlinks, in addition to the attribute file.
 > Also in 3.x a full directory tree had to be created even
 > for incremental backups.
 > 
 > In 4.x it is no longer necessary for a complete backup tree of
 > directories to be created.  Only sufficient directories to store
 > the attribute files necessary for the reverse time deltas are
 > needed.

Hopefully, the code will expose building-block subroutines for
reconstructing the current backup from the reverse time deltas so that
people can create their own extensions for accessing the contents of
specific backups (e.g., like the backuppc fuse file system).

 > In 4.x a new attribute file format is used.  A new magic number
 > identifies the new format.  For each file in the directory,
 > the 4.x attributes additionally stores:
 > 
 >  - the file's digest (ie: MD5 full file digest, with possible
 >    extension for collisions)

By extension, would it be possible to write all the routines in
checksum agnostic fashion so that the choice of MD5 or sha256sum or
whatever could be made by the user by just supplying the right perl
calling routine or wrapper. This would make it possible for users to
use whatever checksum they feel comfortable with -- potentially
trading off speed for reliability

 >    
 >  - an inode number (inode)
 > 
 >  - an inode link count (nlinks)
 > 
 >  - extended attributes (xattr)

1. Would it be possible to allow for simple user-extensions to add other
   fields. This would make it easier and more robust to extend should the
   need or desirability of adding other file attributes arise. 

2. Does the attrib file also presumably contain a pool file name so
   that you can determine the pool file location?

 > 
 > The inode number is generated locally during the backup process
 > (ie: it is not related to the inode number on the client).
 > 
 > The 3.x BPC_FTYPE_HARDLINK file type has been eliminated (other
 > than to support legacy 3.x backups).  Client hardlinks are stored
 > in a more faithful manner that allows the inode reference count
 > to be accurate, and also allows attribute changes on any of the
 > file's hardlink instances to correctly be mirrored among all the
 > instances.
 > 

Will there be a routine to convert legacy 3.x stored backups to 4.x
format?
In some ways having to keep around all the 3.x hard-links would defeat
a lot of the easy copying benefits of 4.x or would require using a
separate partition. 

 > For client hardlinks the inode link count in the attribute file is
 > greater than 1.  In that case, the inode number (locally generated on
 > the server, not the real client inode) is used to access a secondary
 > attribute file that stores the real attributes of hardlinked files.
 > These attribute files are indexed by inode number.  This provides the
 > right semantics - multiple hardlinks really do refer to a single file
 > and corresponding attributes.
 > 
 > Inodes attributes (for client hardlinks) are stored in an additional
 > tree of directories below each backup directory.  The attribute hash key
 > is the inode number.  128 directories are used to store up to 128 attrib
 > files.  The lower 14 bits of the inode number are used to determine the
 > directory (bits 13 to 7) and the file name within that directory
 > (bits 0-6).  These two 7 bit values are encoded in hex (XX and YY):
 > 
 >     $TOPDIR/pc/HOST/NNN/inode/XX/attribYY.
 > 
 > where NNN is the backup number.
 > 

Couple of questions:
1. Does that mean that you are limited to 2^14 client hard links?
2. Does the hardlink attrib file contain a list of the full file paths
   of all the similarly hard-linked files?

   This would be helpful for programs like my BackupPC_deleteFile
   since if you want to delete one hard linked file, it would be
   necessary presumably to decrement the nlinks count in all the
   attrib files of the other linked files. Plus in general, it is nice
   to know what else is hard-linked without having to do an exhaustive
   search through the entire backup.

   In fact, I would imagine that the atribYY file would only need to
   contains such a list and little if any other information since that
   is all that is required presumably to restore the hard-links since
   the actual inode properties are stored (redundantly) in each of the
   individual attrib file representations.


Other questions/suggestions:
1. Are the pool files now labeled with the full file md5sums?
   If so, are you able to get that right out of the rsync checksums
   (for protocol >=30 or so) or do they need to be computed
   separately?

   How is the (unlikely) event of an md5sum collision handled?
   Is it always still necessary to compare actual file contents when
   adding a new file just in case the md5sums collide?

   If so, would it be better to just use a "better" checksum so that
   you wouldn't need to worry about collisions and wouldn't have to
   always do a disk IO and CPU consuming file content comparison to
   check for the unlikely event of collision?

2. If you are changing the appended rsync digest format for cpool
   files using rsync, I think it might be helpful to also store the
   uncompressed filesize in the digest There are several use cases
   (including verifying rsync checksums where the filesize is required
   to determine the blocksize) where I have needed to decompress the
   entire file just to find out its size (and since I am in the pool
   tree I don't have access to the attrib file to know its size).

   Similarly, it might be nice to *always* have the md5sum checksum
   (or other) appended to the file even when not using the rsync
   transfer method. This would help with validating file
   integrity. Even if the md5sum is used for the pool file name, it
   may be nice to allow for an alternative more reliable checksum to
   be stored in the file envelope itself.

3. In the absence of pool hard links, how do you know when a pool file
   can be deleted? Is there some table where the counts are
   incremented/decremented as new backups are added or deleted?


4. Do you have any idea whether 4.0 will be or less resource
   intensive (e.g., disk IO, network, cpu) than 3.x when it comes to:
                - Backing up
                - Reconstructing deltas in the web interface (or fuse 
filesystem)
                - Restoring data


Most importantly, 4.0 sounds great and exciting!

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-devel] BackupPC 4.0 features - attribute file and backup storage

Reply via email to