This describes a possible new feature for dirvish, which might save backup disk space and network time.
The dirvish "branch" option is useful for those who run many machines with the same distribution or similar images. When the initial image for a branch is made, it may be hardlinked to another existing branch, saving backup space. Over time, however, the branch images diverge unnecessarily, especially with automatic upgrades enabled. For example, machines alpha, beta, and gamma may all start out with identical copies of gimp, and if they are set up as branch images, they will all point their backup images at the same gimp files through hard links. Every evening backup will make three more hard links to each of those files. However, if gimp is updated, new gimp files will appear in each of the branch backup images, not linked to the others. They are all the same files, but are not hard linked, wasting backup disk space. Rsync hard linking is performed using the --link-dest option. When dirvish makes a new daily image, it performs a --link-dest to the previous daily image. When dirvish -init is used to make a branch, it makes a --link-dest to the image it is branching from. However, after the branch is made, and after files are upgraded, there is no further connection between the upgraded files in those images. Since version 2.6.4, rsync has allowed multiple --link-dest target directories. If a file is not found in the first --link-dest target, rsync will search the next --link-dest target, then the next, and so forth. This is more work for rsync (I don't know how much more!) but it does create more opportunities for hard linking and space savings. It also saves network transfer time. If we wanted to incorporate this behavior into a future version of dirvish, the most obvious (and incorrect!) thing to do would be to give branch vaults two --link-dest targets, the first being the branches previous image, the other to the branch image it originally branched from. Thus, if beta and gamma were originally branches of alpha, then if alpha was updated, and beta and gamma were subsequently updated, then they would all share the hard link. However, alpha might get updated, or backed up, after beta and gamma. This is the case for my machines - my "alpha" machine for a new distro or new dirvish backup disk is a low priority test machine which can tolerate long initialization times. My high priority machines are branches off of that (they initialize much faster that way). However, during normal operation I back up the high priority machines first, because I need them back in production soonest. A more robust (and more time consuming) way is to do "all to all" --link-dest; every branch checks every other branch before deciding to create a new file. That would maximize hard link possibilities, but checking all those other N-1 branches may take a long time, possibly resulting in an order N^2 slowdown for every changing file. Since most files don't change, that may not be much slowdown in absolute terms, and it would save network bandwidth, so it may still be a win in time savings. A lot depends on the behavior of rsync. A possible compromise is to do two link-dests, and make the second link-dest to the previously backed-up image in the vault. Thus, if three machines are backed up on day N in the order beta(N), alpha(N), gamma(N), then branch image gamma(N) will point its first --link-dest at image gamma(N-1), and its second --link-dest at image alpha(N). beta(N+1) will point at beta(N) and then gamma(N). This would require keeping track of the last successful backup in a vault; the easy way is to leave a symlink after each successful backup, and always use that symlink as the second --link-dest target. This might miss some file linking opportunities, especially if updates on some machines are running at the same time as backups, and it might also fail to connect identical files with slightly differing change times if the checksum option is not used. This "link-to-previous-image-in-the-vault" behavior would be relatively easy to implement as a new option. Perhaps you have better ideas. Perhaps it isn't worth it, because it uses extra disk accesses to save some disk space and network bandwidth, and that may not be a good tradeoff. What do you think? Keith -- Keith Lofstrom [EMAIL PROTECTED] Voice (503)-520-1993 KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon" Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs _______________________________________________ Dirvish mailing list Dirvish@dirvish.org http://www.dirvish.org/mailman/listinfo/dirvish