All-in-all sounds AWESOME - can't wait to see it! My inline comments below... Craig Barratt wrote at about 15:01:04 -0800 on Monday, March 1, 2010: > Jeffrey, > > Thanks for the suggestions. I've since decided to eliminate > hardlinks altogether in 4.x. This is an aggressive design step, > but if successful it will resolve many of the annoying issues > with the current architecture. (To be clear, hardlinks are > still needed in certain cases, since they provide an atomic > way of implementing certain file system operations. But they > will no longer be used for permanent storage.)
I'm sure that wasn't an easy decision but given that probably more than half the recent posts here involve issues related to hard links, I think it is the right long-term decision. Hopefully, it will also broaden the usability of BackupPC to filesystems and OS's that don't support hard links since it seems like you will be eliminating just about all the filesystem-specific requirements. > The plan is to move to full-file MD5 digests, as we previously > discussed. The attrib file will include extended attributes and the > file digest (extended if necessary with the chain number). So the pc > backup tree will just contain attrib files. Each file entry in the > attrib file is its digest (plus chain number) which points to the real > file in the pool. Couple of thoughts/suggestions/questions. Hopefully you can code this in an abstracted way so that other digests could be substituted for MD5 down the road if the need exists. E.g., various shaXXX digests. This would make it more robust/extensible and avoid the concerns of people who think the 1 in 10^40 plus chance of collisions is to large with md5. Also, hopefully your handling of extended attributes will be abstracted too so that the notion of extended attributes could be added to support a broader notion of filesystem attributes and properties. For example, native ntfs has a rich set of ACL's that is broader even than that which is supported by the current cygwin/rsync ACL implementation which only appears to support the POSIX notion of acls. Ideally, there would be a set of constructors that would allow for the definition and backup of such attributes along with constructors for interfacing with the different transport methods. > Chain renumbering has been eliminated, and I am > planning to eliminate BackupPC_link (by solving the race conditions > to allow pool adds to occur during backups). Good idea - any though about getting rid entirely of the notion of chains if we used an even stronger hash function such as sha512? I mean the chance of a collision is then so vanishingly small that you are much more likely to have other types of failure. And of course given current processors, the cost of computing hashes is a lot less than that of disk reads/writes and network bandwidth. > That means the pool file format does not need to be changed. Whether > you access the files via the pool or the pc tree you know the digest > (either from the file name in the pool or the attrib file in the > pc tree). Ahhhh - so are you saying that the partial file md5sums (plus chain number) will still be used for numbering pool entries rather than the full md5sum (or alternative hash)? If you are worried about preserving the old pool, you could of course create a parallel new pool using a new hash function-naming scheme based on the full file that is hard-linked to the existing pool. This wouldn't require any additional storage and would just add a single hard link. Then the old pool could be expired as the old backups expire (or are converted). > Backups will be stored as reverse deltas, so only the most recent is > complete, and all the prior backups are just the deltas to re-create > the prior backups. The prior backups will no longer need to have > complete directory trees - they will only be deep enough to represent > the necessary changes from the next more recent backup. That means > the storage will be decoupled from whether the backup itself is full > or incremental. And all new backups will be relative to the most > recent (ie: IncrLevels will disappear). There are several advantages > here, mainly around efficiency since the most recent backup is the > one that is used most often (for new backups or restores). Plus > the most recent backup will be modified in place, rather than being > rewritten every time with hardlinks. That should improve performance > too. I think it is a *good* idea since the whole notion of difference between incremental and full has become a bit vague when using rsync. Hopefully, you would still have the ability to verify checksums (occasionally) to refresh integrity Also, it might be helpful to be able to "manually" create an "intermediate" full tree either as a way of adding redundancy or if the delta-recreation starts taking too long. Perhaps this could either be specified as a parameter (e.g., start new tree every X backups) or as something that could be triggered manually. > That leaves the database question. Rather than use an external > database my plan is to keep track of the reference count changes > and update the reference counts only daily, since that's how > often the information is needed (for cleaning). There are some > open design issues around integrity and race conditions, and I > will need an fsck-type utility to handle the case when there is > a non-clean shutdown. Given the past flame-wars, I think your approach is a good compromise. The elimination of hard links helps separate the data from the database (which in your case is really just the filesystem tree of attrib files). That being said, as you code, it might be helpful to write the attrib and tree access code in an abstracted way that would make it easier to move to an independent database in the future in case such an approach becomes advantageous in the future. In fact, if I am understanding your new approach correctly, it seems that conceptually your new approach is not that radically different from a pure database approach since your new approach is really now just a (flat) database broken up into multiple pieces distributed across the pc tree. > Despite the significant changes in storage, I'm trying to make 4.x > generally backward compatible. A 3.x pool will gracefully migrate to > an MD5 4.x pool (ie: pool files will be migrated when used in new > backups) and old 3.x backups will be browsable/restorable. However, > one likely design decision is that it will be required that the first > 4.x backups will have to be brand new fulls. Sounds good. Will there also be a routine to migrate a 3.x pool directly and immediately to the 4.x format. Since some of us keep old backups essentially forever, a graceful migration would never fully get there -- so it would be nice to have an offline manual way of converting old backups to the new format and then "throw away" the old 3.x pc tree once one is comfortable that everything is migrated safely. In summary, this all sounds AWESOME -- can't wait to see it! ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/