John Langford [[EMAIL PROTECTED]] writes:

> This is particularly salient to me because I'm pondering running
> rsync on files of size 20GB with a few MB of changes. The standard
> rsync algorithm will do poorly here - resulting in a predicted
> minimum of around 200MB of network utilization for the default block
> size.  Since it happens that I'm not going to be doing insertions
> into the file, the binary search algorithm would actually do quite
> well, resulting in utilization around 1KB for a one-byte change in
> the file.

Can you expand on this further?  It seems to me that to determine that
one-byte change you still need access to the remote side, so must
somehow communicate information about the other "blocks" in the file
to determine that only that one-byte changed.

Unless you mean that the communication occurs in real-time as the
binary search is being performed, so that both sides end up zeroing in
the single portion of the file that has the one-byte change.

But then what if the change is more wide-spread, and shows up on both
halves of the binary search - to me the binary search would imply
zeroing in on a single source of change, but you'd need to allow that
multiple branches down that search might need to be resolved for the
total set of changes.  And that's not just limited to insertion
changes - any changes that crossed over a binary division during the
search would fail to resolve to a single branch to continue the
division.

In any event, for this to work, don't you have to keep computing a
heavy checksum on each binary division of the file in order to detect
the difference?  I suppose if CPU was no problem and bandwidth very
limited that it might be worth it, but a 20GB MD4 is going to take a
while (and then a 10GB, 5GB, etc...), so it might actually be more
time efficient to just transmit the data even on a slow link.

For a 20GB file (assuming large 64K blocks, and with compression
enabled), that's probably about 2MB of data being transmitted, which
is about 15 minutes on a poor quality analog connection (~2200cps).
On my setup with a ~400MHz PIII across a 100bT network connection it
would probably take me about 50 minutes to compute that checksum.  Now
I'm sure there are faster CPUs but it's close enough that it might be
more wall-clock efficient overall to just use the current scheme, even
with smaller blocksizes - although if your changes tend to be
non-insertion, you may as well use the largest blocksize you can.

> So my question is: has anyone thought about a method for mixing
> binary search and the rsync algorithm?  It seems like it might be
> possible.  First of all, instead of using a binary search, use a
> n-ary search for some reasonably large n.  And then, try to make the
> n-ary search robust against insertion using the same hash table
> tricks appearing in the rsync algorithm.
> 
> Has anyone thought about this?

It certainly sounds interesting, although I would think some testing
(even if theoretical) could be useful to determine just how much
additional overhead the additional checksums might create as traded
off versus the potential for less network traffic.

I know that one performance improvement that I'd like to see would be
for the receiver to communicate the blocksize for each file as it
works on it immediately to the sender, so that both receiver and
sender could compute the checksums on the file using that blocksize in
parallel.  For some of my larger files (500-700MB) that's a
significant time issue in my process that unfortunately currently
occurs in series, separated by the time it takes to transmit the
receiver's checksums over an analog line.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: [EMAIL PROTECTED]  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

Reply via email to