> Can you expand on this further?  It seems to me that to determine that

The recursive algorithm which does a binary search for changes works like:

1. If the size is smaller then 128 bytes 
then send it over
else Each side does a strong checksum
        If they are the same, then do nothing
        If they differ then subdivide and go to 1.

This algorithm has network utilization of log(filesize) * 'number of
changes' assuming changes are only bit flips and not insertions or
deletions.  It handles multiple changes reasonably gracefully.

> In any event, for this to work, don't you have to keep computing a
> heavy checksum on each binary division of the file in order to detect
> the difference?  I suppose if CPU was no problem and bandwidth very
> limited that it might be worth it, but a 20GB MD4 is going to take a
> while (and then a 10GB, 5GB, etc...), so it might actually be more
> time efficient to just transmit the data even on a slow link.

The computational cost is O(file size) if changes are rare and O(file size
* log(file size)) if they are not.  The rsync algorithm will also have to
run the strong checksum on O(file size) bits if there are few changes
because the weak checksum will have many hits.

> For a 20GB file (assuming large 64K blocks, and with compression
> enabled), that's probably about 2MB of data being transmitted, which

I get about 20MB by extrapoloation - and compression is not possible here.  

> more wall-clock efficient overall to just use the current scheme, even
> with smaller blocksizes - although if your changes tend to be
> non-insertion, you may as well use the largest blocksize you can.

That's not quite right - even with noninsertion/nondeletion changes there
is a tradeoff between block size and network efficiency as I understand
it.  The tradeoff kicks in because unmatched blocks are sent whole across
the network.

> It certainly sounds interesting, although I would think some testing
> (even if theoretical) could be useful to determine just how much
> additional overhead the additional checksums might create as traded
> off versus the potential for less network traffic.

I think in the "few changes" realm, the number of extra checksums is only
x2 larger.  I am going to think about combining the algorithms a bit more.

> I know that one performance improvement that I'd like to see would be
> for the receiver to communicate the blocksize for each file as it
> works on it immediately to the sender, so that both receiver and
> sender could compute the checksums on the file using that blocksize in
> parallel.  For some of my larger files (500-700MB) that's a

That does seem pretty clearly beneficial and I can't think of a downside
other then programming complexity.

-John


Reply via email to