Rsync's method of taking advantage of information common to two machines to speed the transfer of additional information is interesting philosophically, and I suspect that additional performance improvements might be possible if the rsync algorithm were exploited more fully. For instance, consider the following speculations about behavior that could be added as an option to rsync:
Chris Shoemaker wrote: > [...] where log.4 appears to be a missing file but is really just a > renamed log.3. And, log.3, log.2 and log.1, will probably be > retransmitted in full (there's a problem for another day, but this is > why I was thinking of a hashtable of all files --checksums). [...] If rsync is going to make a table of file checksums beforehand, it could compute the rsync-algorithm hashes of all the blocks of all the receiver's files in the transfer at the very beginning and send all the hashes to the sender. That way, rsync can efficiently handle not only renamed or moved files but also files that were split, joined, or otherwise rearranged. The disadvantage is that, in a very large transfer, there will be lots of block hashes, so the sender will need a lot of memory and a hash table with a lot of buckets so lookup is efficient. Traditionally, if two files associated with the same path in the transfer pass rsync's quick check, they are considered identical for rsync's purposes. Consider this: after the sender constructs a nice, organized hash table from the gigantic list of receiver block hashes, it dumps the hash table into a cache file, noting which receiver file each block hash came from /and/ that file's size, mtime, and checksum (if the checksum was ever computed) on the receiver at the time of the transfer. At the beginning of future transfers, the sender reads the cache into memory in bulk and sends the expected file metadata from the cache along with the file list. If a receiver file matches the corresponding cache clump according to the quick check, then the sender already has the file's block hashes in memory and the receiver doesn't need to do anything! If the file does not match, the sender discards its cache clump, the receiver computes the hashes, and the sender stores them in the table. Along these lines, it might even be possible to use the rsync algorithm itself to synchronize the file lists or block hash lists of the two sides before transferring of real data begins. -- Matt McCutchen, ``hashproduct'' [EMAIL PROTECTED] -- http://mysite.verizon.net/hashproduct/ -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
