Craig Barratt wrote at about 00:33:13 -0800 on Wednesday, March 2, 2011: > The next topic is the attribute file format and how backups > are stored.
Were there previous topics that you posted to the mailing list that I missed? :P > > In 4.x the only information that appears in a backup tree > (ie: a single backup stored below $TOPDIR/pc/HOST/NNN) are > the attribute files, one per directory. > > In contrast, a filled backup in 3.x includes a complete set > of files using hardlinks, in addition to the attribute file. > Also in 3.x a full directory tree had to be created even > for incremental backups. > > In 4.x it is no longer necessary for a complete backup tree of > directories to be created. Only sufficient directories to store > the attribute files necessary for the reverse time deltas are > needed. Hopefully, the code will expose building-block subroutines for reconstructing the current backup from the reverse time deltas so that people can create their own extensions for accessing the contents of specific backups (e.g., like the backuppc fuse file system). > In 4.x a new attribute file format is used. A new magic number > identifies the new format. For each file in the directory, > the 4.x attributes additionally stores: > > - the file's digest (ie: MD5 full file digest, with possible > extension for collisions) By extension, would it be possible to write all the routines in checksum agnostic fashion so that the choice of MD5 or sha256sum or whatever could be made by the user by just supplying the right perl calling routine or wrapper. This would make it possible for users to use whatever checksum they feel comfortable with -- potentially trading off speed for reliability > > - an inode number (inode) > > - an inode link count (nlinks) > > - extended attributes (xattr) 1. Would it be possible to allow for simple user-extensions to add other fields. This would make it easier and more robust to extend should the need or desirability of adding other file attributes arise. 2. Does the attrib file also presumably contain a pool file name so that you can determine the pool file location? > > The inode number is generated locally during the backup process > (ie: it is not related to the inode number on the client). > > The 3.x BPC_FTYPE_HARDLINK file type has been eliminated (other > than to support legacy 3.x backups). Client hardlinks are stored > in a more faithful manner that allows the inode reference count > to be accurate, and also allows attribute changes on any of the > file's hardlink instances to correctly be mirrored among all the > instances. > Will there be a routine to convert legacy 3.x stored backups to 4.x format? In some ways having to keep around all the 3.x hard-links would defeat a lot of the easy copying benefits of 4.x or would require using a separate partition. > For client hardlinks the inode link count in the attribute file is > greater than 1. In that case, the inode number (locally generated on > the server, not the real client inode) is used to access a secondary > attribute file that stores the real attributes of hardlinked files. > These attribute files are indexed by inode number. This provides the > right semantics - multiple hardlinks really do refer to a single file > and corresponding attributes. > > Inodes attributes (for client hardlinks) are stored in an additional > tree of directories below each backup directory. The attribute hash key > is the inode number. 128 directories are used to store up to 128 attrib > files. The lower 14 bits of the inode number are used to determine the > directory (bits 13 to 7) and the file name within that directory > (bits 0-6). These two 7 bit values are encoded in hex (XX and YY): > > $TOPDIR/pc/HOST/NNN/inode/XX/attribYY. > > where NNN is the backup number. > Couple of questions: 1. Does that mean that you are limited to 2^14 client hard links? 2. Does the hardlink attrib file contain a list of the full file paths of all the similarly hard-linked files? This would be helpful for programs like my BackupPC_deleteFile since if you want to delete one hard linked file, it would be necessary presumably to decrement the nlinks count in all the attrib files of the other linked files. Plus in general, it is nice to know what else is hard-linked without having to do an exhaustive search through the entire backup. In fact, I would imagine that the atribYY file would only need to contains such a list and little if any other information since that is all that is required presumably to restore the hard-links since the actual inode properties are stored (redundantly) in each of the individual attrib file representations. Other questions/suggestions: 1. Are the pool files now labeled with the full file md5sums? If so, are you able to get that right out of the rsync checksums (for protocol >=30 or so) or do they need to be computed separately? How is the (unlikely) event of an md5sum collision handled? Is it always still necessary to compare actual file contents when adding a new file just in case the md5sums collide? If so, would it be better to just use a "better" checksum so that you wouldn't need to worry about collisions and wouldn't have to always do a disk IO and CPU consuming file content comparison to check for the unlikely event of collision? 2. If you are changing the appended rsync digest format for cpool files using rsync, I think it might be helpful to also store the uncompressed filesize in the digest There are several use cases (including verifying rsync checksums where the filesize is required to determine the blocksize) where I have needed to decompress the entire file just to find out its size (and since I am in the pool tree I don't have access to the attrib file to know its size). Similarly, it might be nice to *always* have the md5sum checksum (or other) appended to the file even when not using the rsync transfer method. This would help with validating file integrity. Even if the md5sum is used for the pool file name, it may be nice to allow for an alternative more reliable checksum to be stored in the file envelope itself. 3. In the absence of pool hard links, how do you know when a pool file can be deleted? Is there some table where the counts are incremented/decremented as new backups are added or deleted? 4. Do you have any idea whether 4.0 will be or less resource intensive (e.g., disk IO, network, cpu) than 3.x when it comes to: - Backing up - Reconstructing deltas in the web interface (or fuse filesystem) - Restoring data Most importantly, 4.0 sounds great and exciting! ------------------------------------------------------------------------------ Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev _______________________________________________ BackupPC-devel mailing list BackupPC-devel@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/