John Langford [[EMAIL PROTECTED]] writes:

> > For a 20GB file (assuming large 64K blocks, and with compression
> > enabled), that's probably about 2MB of data being transmitted, which
> 
> I get about 20MB by extrapoloation - and compression is not possible here.


If I'm counting correctly, 20GB at 64K blocks should yield about
300,000-330,000 blocks (depending on how you define GB).  The bulk of
the data being transmitted from receiver to sender is a 20 byte chunk
containing the weak and strong checksum for each block.  So worst case
(no compression at all) you'd have to transmit 6-6.6MB of data - how
do you arrive at 20MB?

The compression (the -z option to rsync, or even modem compression
like V.42bis) does have some impact here in my experience.  Although I
agree the MD4 chunks themselves should be nicely random, since each
chunk is computed independently the stream as a whole may be able to
see some reuse.  At least with my 100-700MB files, I seem to get more
along the lines of 6-10 bytes per block (e.g., a 100MB file with a
1024 byte block transmits about 600-700K of information from the
receiver to the transmitter).  I haven't scientifically measured it,
but have watched my transmission counts during testing enough to
notice it.

> > more wall-clock efficient overall to just use the current scheme, even
> > with smaller blocksizes - although if your changes tend to be
> > non-insertion, you may as well use the largest blocksize you can.
> 
> That's not quite right - even with noninsertion/nondeletion changes there
> is a tradeoff between block size and network efficiency as I understand
> it.  The tradeoff kicks in because unmatched blocks are sent whole across
> the network.

That's sort of what I was getting at.. for example, if your file is
appended to only, you might as well use the largest block size you can
since it will minimize the meta-data being exchanged in order to
deduce that the bulk of the file is identical, and then the new data
will be transmitted - but since it has to be completely transmitted
anyway, you may as well do it with as large a chunk as possible.

Yes, this does mean you may send a partial amount of old data in the
first block (at the overlap point between old and new data) but that's
probably in the noise waste-wise, and yes if your newly appended data
might match old data already in the file but only at a smaller block
size, using the larger one may prevent that match.  So knowing how the
file in question grows and the data it contains may of course qualify
things.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: [EMAIL PROTECTED]  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

Reply via email to