Craig Barratt wrote at about 23:24:21 -0800 on Wednesday, March 2, 2011: > > By extension, would it be possible to write all the routines in > > checksum agnostic fashion so that the choice of MD5 or sha256sum or > > whatever could be made by the user by just supplying the right perl > > calling routine or wrapper. This would make it possible for users to > > use whatever checksum they feel comfortable with -- potentially > > trading off speed for reliability > > Let me think about this. I want to match the full-file checksum > in rsync because it allows some clever things to be done. For > example, with some hacks, it would allow a new client file to > be matched even the first time if it is in the pool and you use > the --checksum option to rsync (this requires some changes to > the server-side rsync).
I expanded upon my case for doing this in another email, but let me add some detail here too. First, the one-time cost of doing a SHA-256 or SHA-512 checksum is going to be trivial relative to the cost of reading that file across the network and writing it to disk. In contrast, the cost of having to read, decompress, and byte compare the files each time they are encountered because of collision potential is significant and is the gift that keeps on giving... Second, while I exclusively use rsync, I think that as long as multiple transfer types are supported, that the basic design of BackupPC should not be driven by the fact that rsync now happens to use md5sums (and before used md4sums etc.). Regarding your point about clever things that could be done using rsync checksums, I would suggest that a more rsync-agnostic method alternative would be to add a parameter to the attrib file that stores the md5sum whenever the rsync transfer method is used. Though I guess this would not help for the first time hack. Alternatively, if you want the first time hack to work then you could make the pool file name equal to: <md5sum>_<SHA-256sum> which would still be smaller than SHA-512sum and I would wager that we are unlikely ever to start seeing lots of files with simultaneous collisions of the md5 and the SHA-256 checksums. In a sense, the SHA-256 checksum would act like a unique chain suffix and since it would always be there you never would have to actually decompress and compare the files to see if a chain is necessary. Plus you then would have two essentially independent checksums built into the file name. > > Will there be a routine to convert legacy 3.x stored backups to 4.x > > format? > > The pool effectively gets migrated as each pool file is needed for 4.x. > But the old backups are not. > > > In some ways having to keep around all the 3.x hard-links would defeat > > a lot of the easy copying benefits of 4.x or would require using a > > separate partition. > > It's a good point. However, such a utility would probably have a very > long running time. It wouldn't be reversible, and I would be very > concerned about cases where it was killed or failed before finishing. > I would be a strong proponent of having such a conversion routine since I can't imagine wanting to have a mixed environment where the 3.x case either remains unchanged forever or slowly but never completely migrates over to 4.x as individual pool files get converted. This mixed case would be worse from my perspective than just staying with 3.x I don't mind the speed, since as I pointed out in my other posting, the migration can be done in the background as 4.x runs. I would also make a partition backup of my old 3.x system just in case. Regarding the killed or failed concern, at most I imagine one file would be corrupted but this is no worse than the potential of single file corruption in your use case where files individually get migrated as they reappear in the 4.x context. Event that could be avoided using an approach that would only remove old 3.x pc tree backups after each individual backup had been fully converted to 4.x. The pool files would only be cleaned up after their nlinks goes to 1 (as with the 3.x BackupPC_nightly). The proposed approach would look like this: 1. Iterate through each host/backup in reverse numerical order (since deltas in 4.x are reverse-time) 2. Recurse through the shares in each backup and create a new 4x filled or unfilled backup equivalent to the 3x backup. This involves: - Creating new 4.x directories and attrib files where necessary based on the filled/unfilled condition and the reverse deltas - Copying (or maybe hard linking) the old 3.x pool files to a new 4.x named file if the file is not already present in the 4.x pool (as determined by the full file checksum [Note hard linking probably is not a good idea for rsync since the rsync digests won't be the same] - After each individual backup is converted, delete it from the 3.x pc tree (this works since in 3.x you can always delete the most recent backup and we are going in reverse numerical order) - Optionally, once in a while run BackupPC_nightly version 3.x on the 3.x pool to remove orphans as 3.x backups are deleted. Alternatively, wait to the end and just delete the entire pool. - Optionally, speed up the process by creating an inode-sequenced pool (analogous to my code in BackupPC_copyPcPool) where each file is named by the inode of a corresponding pool file and contains the md5sum of the corresponding pool file. Then the md5sum (or ideally better checksum per my previous post) would be used to see if the encountered backup file is already in the 4.x pool and if not then what it should be named. This lookup pool allows the md5sum to be calculated just once and then looked up as needed based on the inode number. Also, I would feel much more comfortable monitoring a single process focused just on this conversion with defined start and stop points than an ongoing process that haphazardly and incompletely renames 3.x to 4.x pool files as they are re-encountered. > I was thinking I would add a utility that lists all the 4.x directories > (below $TOPDIR) that should be backed up. Those wouldn't have hardlinks. > That could be used with a user-supplied script to backup or replicate > all the 4.x backups. > > > Couple of questions: > > > 1. Does that mean that you are limited to 2^14 client hard links? > > No. It's 2^14 files, each of which could, in theory subject > to memory, hold a large number of entries. That's what I meant. But what if you have more than 2^14 different files that have hard links. Say, I (stupidly) want to use BackupPC 4.x to backup an old BackupPC 3.x archive that will surely have more than 2^14 files. > > > 2. Does the hardlink attrib file contain a list of the full file paths > > of all the similarly hard-linked files? > > No. Just like a regular file sysem the reverse lookup isn't easy. > > > This would be helpful for programs like my BackupPC_deleteFile > > since if you want to delete one hard linked file, it would be > > necessary presumably to decrement the nlinks count in all the > > attrib files of the other linked files. > > No, that's not necessary. The convention is that if nlinks >= 2 > in a file entry, all that means is the real data is in the inode > entry (and the nlinks could actually be 1). Yes but it gets messy when the last hard link is removed and now nlinks=1. Yet you don't know what the last hard linked file is. True, it would still work but it would look like nlinks >1 when in reality there is just a single link. > > > Plus in general, it is nice > > to know what else is hard-linked without having to do an exhaustive > > search through the entire backup. > > > > In fact, I would imagine that the atribYY file would only need to > > contains such a list and little if any other information since that > > is all that is required presumably to restore the hard-links since > > the actual inode properties are stored (redundantly) in each of the > > individual attrib file representations. > > True. But you have a replicate and update all the file attributes > in that case. So any operation (like chmod etc) would require all > file attributes to be changed. Good point. But do you ever do that in practice? I think though I need to better understand how you propose handling hard links since I'm sure I am missing something in the details. > > Is it always still necessary to compare actual file contents when > > adding a new file just in case the md5sums collide? > > Yes. Yuck... I still think better to compute a strong checksum once which just adds minimal cpu overhead to the rate limiting file transfer and disk write speeds then to have to do the SLOW operations of disk reading, decompressing, and file comparison each time. I know I am repeating myself but I think this is key. > > > If so, would it be better to just use a "better" checksum so that > > you wouldn't need to worry about collisions and wouldn't have to > > always do a disk IO and CPU consuming file content comparison to > > check for the unlikely event of collision? > > Good point, but as I mentioned above I would like to take advantage > of the fact that it is the same full-file checksum as rsync. See my above suggestion of using a hybrid file name: <md5sum>_<SHA-256sum> > > > 2. If you are changing the appended rsync digest format for cpool > > files using rsync, I think it might be helpful to also store the > > uncompressed filesize in the digest There are several use cases > > (including verifying rsync checksums where the filesize is required > > to determine the blocksize) where I have needed to decompress the > > entire file just to find out its size (and since I am in the pool > > tree I don't have access to the attrib file to know its size). > > Currently 4.x won't use the appended checksums. I'll explain how > I'm implemented rsync on the server side in another email. I could > add that for later, but it is more complex. I imagine that knowing the full file md5sum will save you when the files are unchanged but that the absence of the block digests will prevent you from taking advantage of some of the power of the rsync delta functionality (on the other hand, I'm not quite sure how that even works in 3.x since the block digests presumably won't match if the file size changes resulting in a different blocksize). > > > Similarly, it might be nice to *always* have the md5sum checksum > > (or other) appended to the file even when not using the rsync > > transfer method. This would help with validating file > > integrity. Even if the md5sum is used for the pool file name, it > > may be nice to allow for an alternative more reliable checksum to > > be stored in the file envelope itself. > > Interesting idea. Of course this wouldn't be necessary if the file name becomes <md5sum>_<SHA-256sum> If this convention is used it might be nice to store the two as separate attributes in the attrib file since it is easier (i.e. faster) to join the two parts than to use a regex to decompose them. > > > 4. Do you have any idea whether 4.0 will be or less resource > > intensive (e.g., disk IO, network, cpu) than 3.x when it comes to: > > - Backing up > > - Reconstructing deltas in the web interface (or fuse > > filesystem) > > - Restoring data > > I hope it will be a lot more efficient, but I don't have any data > yet. There are several areas where efficiency will be much better. > For example, with reverse-deltas, a full or incremental backup with > no changes shouldn't need any significant disk writes. In contrast > 3.x has to create a directory tree and in the case of a full make > a complete set of hardlinks. > As per my previous post, my concern is what happens on low memory machines (my 65MB RAM NAS example) with large poolCnt files. > Also, rsync on the server side will be based on a native C rsync. > I'll send another email about that. > > Most importantly, 4.0 sounds great and exciting! > > I hope to get some time to work on it again. Unfortunately I haven't > made any progress in the last 4 months. Work has been very busy. My vote would be to take your time to get it right -- since this is a once in a blue moon opportunity to make significant changes to the formats and algorithms. 3.x is really solid and for me it will be worth the wait to have a robust, well-thought out, and extensible 4.x implementation. Once you release 4.x, it will be much harder to make significant changes and I would imagine that you don't expect to embark on another significant rewrite for a long, long time. ------------------------------------------------------------------------------ Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev _______________________________________________ BackupPC-devel mailing list BackupPC-devel@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/