John Langford [[EMAIL PROTECTED]] writes:
> > For a 20GB file (assuming large 64K blocks, and with compression
> > enabled), that's probably about 2MB of data being transmitted, which
>
> I get about 20MB by extrapoloation - and compression is not possible here.
If I'm counting correctly, 20GB at 64K blocks should yield about
300,000-330,000 blocks (depending on how you define GB). The bulk of
the data being transmitted from receiver to sender is a 20 byte chunk
containing the weak and strong checksum for each block. So worst case
(no compression at all) you'd have to transmit 6-6.6MB of data - how
do you arrive at 20MB?
The compression (the -z option to rsync, or even modem compression
like V.42bis) does have some impact here in my experience. Although I
agree the MD4 chunks themselves should be nicely random, since each
chunk is computed independently the stream as a whole may be able to
see some reuse. At least with my 100-700MB files, I seem to get more
along the lines of 6-10 bytes per block (e.g., a 100MB file with a
1024 byte block transmits about 600-700K of information from the
receiver to the transmitter). I haven't scientifically measured it,
but have watched my transmission counts during testing enough to
notice it.
> > more wall-clock efficient overall to just use the current scheme, even
> > with smaller blocksizes - although if your changes tend to be
> > non-insertion, you may as well use the largest blocksize you can.
>
> That's not quite right - even with noninsertion/nondeletion changes there
> is a tradeoff between block size and network efficiency as I understand
> it. The tradeoff kicks in because unmatched blocks are sent whole across
> the network.
That's sort of what I was getting at.. for example, if your file is
appended to only, you might as well use the largest block size you can
since it will minimize the meta-data being exchanged in order to
deduce that the bulk of the file is identical, and then the new data
will be transmitted - but since it has to be completely transmitted
anyway, you may as well do it with as large a chunk as possible.
Yes, this does mean you may send a partial amount of old data in the
first block (at the overlap point between old and new data) but that's
probably in the noise waste-wise, and yes if your newly appended data
might match old data already in the file but only at a smaller block
size, using the larger one may prevent that match. So knowing how the
file in question grows and the data it contains may of course qualify
things.
-- David
/-----------------------------------------------------------------------\
\ David Bolen \ E-mail: [EMAIL PROTECTED] /
| FitLinxx, Inc. \ Phone: (203) 708-5192 |
/ 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \
\-----------------------------------------------------------------------/