I recently read through the rsync technical report and discovered that
rsync doesn't use the algorithm that I expected. I expected an algorithm
which does a binary search for the differences resulting in a network
utilization of O(min(log(F)*C,F)) where F is the filesize and C is the
number of changed bytes. The rsync algorithm appears to actually be
O(F+C) although the O notation is misleading here because the constants
differ greatly.
The rsync algorithm has several advantages. It is robust against
insertions and avoids incurring high latency by using a two-round
protocol. The binary search algorithm would have a log(F) round protocol,
would require more computation, and would fail to cope with insertions.
This is particularly salient to me because I'm pondering running rsync on
files of size 20GB with a few MB of changes. The standard rsync algorithm
will do poorly here - resulting in a predicted minimum of around 200MB of
network utilization for the default block size. Since it happens that I'm
not going to be doing insertions into the file, the binary search
algorithm would actually do quite well, resulting in utilization around
1KB for a one-byte change in the file.
I've tried altering the block size in rsync. This helps quite a bit. With
a block size of 64KB, I get a network utilization for identical copies
which is around x10 better. (side note: on an identical file sync, I
observed that network utilization _increased_ when I set block-size=1024.
I have no explanation for this. Do you?) But 20MB would be still
noticeably worse then 1KB, especially over slow links.
So my question is: has anyone thought about a method for mixing binary
search and the rsync algorithm? It seems like it might be possible.
First of all, instead of using a binary search, use a n-ary search for
some reasonably large n. And then, try to make the n-ary search robust
against insertion using the same hash table tricks appearing in the rsync
algorithm.
Has anyone thought about this?
-John