Craig Barratt wrote at about 23:24:21 -0800 on Wednesday, March 2, 2011:
 
 > > By extension, would it be possible to write all the routines in
 > > checksum agnostic fashion so that the choice of MD5 or sha256sum or
 > > whatever could be made by the user by just supplying the right perl
 > > calling routine or wrapper. This would make it possible for users to
 > > use whatever checksum they feel comfortable with -- potentially
 > > trading off speed for reliability
 > 
 > Let me think about this.  I want to match the full-file checksum
 > in rsync because it allows some clever things to be done.  For
 > example, with some hacks, it would allow a new client file to
 > be matched even the first time if it is in the pool and you use
 > the --checksum option to rsync (this requires some changes to
 > the server-side rsync).

I expanded upon my case for doing this in another email, but let me
add some detail here too.

First, the one-time cost of doing a SHA-256 or SHA-512 checksum is
going to be trivial relative to the cost of reading that file across
the network and writing it to disk. In contrast, the cost of having to
read, decompress, and byte compare the files each time they are
encountered because of collision potential is significant and is the
gift that keeps on giving...

Second, while I exclusively use rsync, I think that as long as
multiple transfer types are supported, that the basic design of
BackupPC should not be driven by the fact that rsync now happens to
use md5sums (and before used md4sums etc.).

Regarding your point about clever things that could be done using
rsync checksums, I would suggest that a more rsync-agnostic method
alternative would be to add a parameter to the attrib file that stores
the md5sum whenever the rsync transfer method is used. Though I guess
this would not help for the first time hack.

Alternatively, if you want the first time hack to work then you could
make the pool file name equal to: <md5sum>_<SHA-256sum> which would
still be smaller than SHA-512sum and I would wager that we are
unlikely ever to start seeing lots of files with simultaneous
collisions of the md5 and the SHA-256 checksums. In a sense, the
SHA-256 checksum would act like a unique chain suffix and since it
would always be there you never would have to actually decompress and
compare the files to see if a chain is necessary. Plus you then would
have two essentially independent checksums built into the file name.

 > > Will there be a routine to convert legacy 3.x stored backups to 4.x
 > > format?
 > 
 > The pool effectively gets migrated as each pool file is needed for 4.x.
 > But the old backups are not.
 > 
 > > In some ways having to keep around all the 3.x hard-links would defeat
 > > a lot of the easy copying benefits of 4.x or would require using a
 > > separate partition. 
 > 
 > It's a good point.  However, such a utility would probably have a very
 > long running time.  It wouldn't be reversible, and I would be very
 > concerned about cases where it was killed or failed before finishing.
 > 

I would be a strong proponent of having such a conversion routine
since I can't imagine wanting to have a mixed environment where the
3.x case either remains unchanged forever or slowly but never
completely migrates over to 4.x as individual pool files get
converted. This mixed case would be worse from my perspective than
just staying with 3.x

I don't mind the speed, since as I pointed out in my other posting,
the migration can be done in the background as 4.x runs. I would also
make a partition backup of my old 3.x system just in case. Regarding
the killed or failed concern, at most I imagine one file would be corrupted
but this is no worse than the potential of single file corruption in
your use case where files individually get migrated as they reappear in the 4.x
context. Event that could be avoided using an approach that would only
remove old 3.x pc tree backups after each individual backup had been
fully converted to 4.x. The pool files would only be cleaned up after
their nlinks goes to 1 (as with the 3.x BackupPC_nightly).

The proposed approach would look like this:
1. Iterate through each host/backup in reverse numerical order (since
   deltas in 4.x are reverse-time)
2. Recurse through the shares in each backup and create a new 4x
   filled or unfilled backup equivalent to the 3x backup. This
   involves:
                - Creating new 4.x directories and attrib files where necessary 
based
                  on the filled/unfilled condition and the reverse deltas
                - Copying (or maybe hard linking) the old 3.x pool files to a
                  new 4.x named file if the file is not already present in the
                  4.x pool (as determined by the full file checksum
                  [Note hard linking probably is not a good idea for rsync
                  since the rsync digests won't be the same]
                - After each individual backup is converted, delete it from
                  the 3.x pc tree (this works since in 3.x you can always
                  delete the most recent backup and we are going in reverse
                  numerical order)
                - Optionally, once in a while run BackupPC_nightly version 3.x
                  on the 3.x pool to remove orphans as 3.x backups are
                  deleted. Alternatively, wait to the end and just delete the
                  entire pool.
                - Optionally, speed up the process by creating an
                  inode-sequenced pool (analogous to my code in
                  BackupPC_copyPcPool) where each file is named by the inode
                  of a corresponding pool file and contains the md5sum of
                  the corresponding pool file. Then the md5sum (or ideally
                  better checksum per my previous post) would be used to see
                  if the encountered backup file is already in the 4.x pool
                  and if not then what it should be named. This lookup pool
                  allows the md5sum to be calculated just once and then looked
                  up as needed based on the inode number.

Also, I would feel much more comfortable monitoring a single
process focused just on this conversion with defined start and stop
points than an ongoing process that haphazardly and incompletely
renames 3.x to 4.x pool files as they are re-encountered.


 > I was thinking I would add a utility that lists all the 4.x directories
 > (below $TOPDIR) that should be backed up.  Those wouldn't have hardlinks.
 > That could be used with a user-supplied script to backup or replicate
 > all the 4.x backups.
 > 
 > > Couple of questions:
 > 
 > > 1. Does that mean that you are limited to 2^14 client hard links?
 > 
 > No.  It's 2^14 files, each of which could, in theory subject
 > to memory, hold a large number of entries.

That's what I meant. But what if you have more than 2^14 different files that
have hard links. Say, I (stupidly) want to use BackupPC 4.x to backup
an old BackupPC 3.x archive that will surely have more than 2^14 files.

 > 
 > > 2. Does the hardlink attrib file contain a list of the full file paths
 > >    of all the similarly hard-linked files?
 > 
 > No.  Just like a regular file sysem the reverse lookup isn't easy.
 > 
 > >    This would be helpful for programs like my BackupPC_deleteFile
 > >    since if you want to delete one hard linked file, it would be
 > >    necessary presumably to decrement the nlinks count in all the
 > >    attrib files of the other linked files.
 > 
 > No, that's not necessary.  The convention is that if nlinks >= 2
 > in a file entry, all that means is the real data is in the inode
 > entry (and the nlinks could actually be 1).

Yes but it gets messy when the last hard link is removed and now
nlinks=1. Yet you don't know what the last hard linked file is. True,
it would still work but it would look like nlinks >1 when in reality
there is just a single link.

 > 
 > >    Plus in general, it is nice
 > >    to know what else is hard-linked without having to do an exhaustive
 > >    search through the entire backup.
 > > 
 > >    In fact, I would imagine that the atribYY file would only need to
 > >    contains such a list and little if any other information since that
 > >    is all that is required presumably to restore the hard-links since
 > >    the actual inode properties are stored (redundantly) in each of the
 > >    individual attrib file representations.
 > 
 > True.  But you have a replicate and update all the file attributes
 > in that case.  So any operation (like chmod etc) would require all
 > file attributes to be changed.

Good point. But do you ever do that in practice?
I think though I need to better understand how you propose handling
hard links since I'm sure I am missing something in the details.

 
 > >    Is it always still necessary to compare actual file contents when
 > >    adding a new file just in case the md5sums collide?
 > 
 > Yes.
Yuck... I still think better to compute a strong checksum once which
just adds minimal cpu overhead to the rate limiting file transfer and
disk write speeds then to have to do the SLOW operations of disk reading,
decompressing, and file comparison each time. I know I am repeating
myself but I think this is key.

 > 
 > >    If so, would it be better to just use a "better" checksum so that
 > >    you wouldn't need to worry about collisions and wouldn't have to
 > >    always do a disk IO and CPU consuming file content comparison to
 > >    check for the unlikely event of collision?
 > 
 > Good point, but as I mentioned above I would like to take advantage
 > of the fact that it is the same full-file checksum as rsync.

See my above suggestion of using a hybrid file name:
        <md5sum>_<SHA-256sum>

 > 
 > > 2. If you are changing the appended rsync digest format for cpool
 > >    files using rsync, I think it might be helpful to also store the
 > >    uncompressed filesize in the digest There are several use cases
 > >    (including verifying rsync checksums where the filesize is required
 > >    to determine the blocksize) where I have needed to decompress the
 > >    entire file just to find out its size (and since I am in the pool
 > >    tree I don't have access to the attrib file to know its size).
 > 
 > Currently 4.x won't use the appended checksums.  I'll explain how
 > I'm implemented rsync on the server side in another email.  I could
 > add that for later, but it is more complex.

I imagine that knowing the full file md5sum will save you when the
files are unchanged but that the absence of the block digests will
prevent you from taking advantage of some of the power of the rsync
delta functionality (on the other hand, I'm not quite sure how that
even works in 3.x since the block digests presumably won't match if
the file size changes resulting in a different blocksize).

 > 
 > >    Similarly, it might be nice to *always* have the md5sum checksum
 > >    (or other) appended to the file even when not using the rsync
 > >    transfer method. This would help with validating file
 > >    integrity. Even if the md5sum is used for the pool file name, it
 > >    may be nice to allow for an alternative more reliable checksum to
 > >    be stored in the file envelope itself.
 > 
 > Interesting idea.
Of course this wouldn't be necessary if the file name becomes 
   <md5sum>_<SHA-256sum>
If this convention is used it might be nice to store the two as
separate attributes in the attrib file since it is easier
(i.e. faster) to join the two parts than to use a regex to decompose
them.
 > 
 > > 4. Do you have any idea whether 4.0 will be or less resource
 > >    intensive (e.g., disk IO, network, cpu) than 3.x when it comes to:
 > >                    - Backing up
 > >            - Reconstructing deltas in the web interface (or fuse 
 > > filesystem)
 > >            - Restoring data
 > 
 > I hope it will be a lot more efficient, but I don't have any data
 > yet.  There are several areas where efficiency will be much better.
 > For example, with reverse-deltas, a full or incremental backup with
 > no changes shouldn't need any significant disk writes.  In contrast
 > 3.x has to create a directory tree and in the case of a full make
 > a complete set of hardlinks.
 > 
As per my previous post, my concern is what happens on low memory
machines (my 65MB RAM NAS example) with large poolCnt files.

 > Also, rsync on the server side will be based on a native C rsync.
 > I'll send another email about that.



 > > Most importantly, 4.0 sounds great and exciting!
 > 
 > I hope to get some time to work on it again.  Unfortunately I haven't
 > made any progress in the last 4 months.  Work has been very busy.

My vote would be to take your time to get it right -- since this is a
once in a blue moon opportunity to make significant changes to the
formats and algorithms. 3.x is really solid and for me it will be
worth the wait to have a robust, well-thought out, and extensible 4.x
implementation. Once you release 4.x, it will be much harder to make
significant changes and I would imagine that you don't expect to
embark on another significant rewrite for a long, long time.

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to