Re: Extremely poor rsync performance on very large files (near 100GB and larger)

2007-10-07 Thread Wayne Davison
On Mon, Jan 08, 2007 at 10:16:01AM -0800, Wayne Davison wrote:
 And one final thought that occurred to me:  it would also be possible
 for the sender to segment a really large file into several chunks,
 handling each one without overlap, all without the generator or the
 receiver knowing that it was happening.

I have a patch that implements this:

http://rsync.samba.org/ftp/unpacked/rsync/patches/segment_large_hash.diff

I'm wondering if anyone has any feedback on such a method being included
in rsync?

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Extremely poor rsync performance on very large files (near 100GB and larger)

2007-10-07 Thread Matt McCutchen
On 10/7/07, Wayne Davison [EMAIL PROTECTED] wrote:
 On Mon, Jan 08, 2007 at 10:16:01AM -0800, Wayne Davison wrote:
  And one final thought that occurred to me:  it would also be possible
  for the sender to segment a really large file into several chunks,
  handling each one without overlap, all without the generator or the
  receiver knowing that it was happening.

 I have a patch that implements this:

 http://rsync.samba.org/ftp/unpacked/rsync/patches/segment_large_hash.diff

I like better performance, but I'm not entirely happy with a fixed
upper limit on the distance that data can migrate and still be matched
by the delta-transfer algorithm: if someone is copying an image of an
entire hard disk and rearranges the partitions within the disk, rsync
will needlessly retransmit all the partition data.  An alternative
would be to use several different block sizes spaced by a factor of 16
or so and have a separate hash table for each.  Each hash table would
hold checksums for a sliding window of 8/10*TABLESIZE blocks around
the current position.  This way, small blocks could be matched across
small distances without overloading the hash table, and large blocks
could still be matched across large distances.

Matt
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Extremely poor rsync performance on very large files (near 100GB and larger)

2007-01-12 Thread Shachar Shemesh
Evan Harris wrote:
 Would it make more sense just to make rsync pick a more sane blocksize
 for very large files?  I say that without knowing how rsync selects
 the blocksize, but I'm assuming that if a 65k entry hash table is
 getting overloaded, it must be using something way too small.
rsync picks a block size that is the square root of the file size. As I
didn't write this code, I can safely say that it seems like a very good
compromise between too small block sizes (too many hash lookups) and too
large blocksizes (decreased chance of finding matches).
 Should it be scaling the blocksize with a power-of-2 algorithm rather
 than the hash table (based on filesize)?
If Wayne intends to make the hash size a power of 2, maybe selecting
block sizes that are smaller will make sense. We'll see how 3.0 comes along.
 I haven't tested to see if that would work.  Will -B accept a value of
 something large like 16meg?
It should. That's about 10 times the block size you need in order to not
overflow the hash table, though, so a block size of 2MB would seem more
appropriate to me for a file size of 100GB.
   At my data rates, that's about a half a second of network bandwidth,
 and seems entirely reasonable.
 Evan
I would just like to note that since I submitted the large hash table
patch, I have seen no feedback on anyone actually testing it. If you can
compile a patched rsync and report how it goes, that would be very
valuable to me.

Shachar
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Extremely poor rsync performance on very large files (near 100GB and larger)

2007-01-08 Thread Wayne Davison
On Mon, Jan 08, 2007 at 01:37:45AM -0600, Evan Harris wrote:
 I've been playing with rsync and very large files approaching and 
 surpassing 100GB, and have found that rsync has excessively very poor 
 performance on these very large files, and the performance appears to 
 degrade the larger the file gets.

Yes, this is caused by the current hashing algorithm that the sender
uses to find matches for moved data.  The current hash table has a fixed
size of 65536 slots, and can get overloaded for really large files.

There is a diff in the patches dir that makes rsync work better with
large files: dynamic_hash.diff.  This makes the size of the hash table
depend on how many blocks there are in the transfer.  It does speed up
the transfer of large files significantly, but since it introduces a mod
(%) operation on a per-byte basis, it slows down the transfer of normal
sized files significantly.

I'm going to be checking into using a hash algorithm with a table that
is always a power of 2 in size as an alternative implementation of this
dynamic hash algorithm.  That will hopefully not bloat the CPU time for
normal-sized files.  Alternately, the hashing algorithm could be made to
vary depending on the file's size.  I'm hoping to have this improved in
the upcoming 3.0.0 release.

And one final thought that occurred to me:  it would also be possible
for the sender to segment a really large file into several chunks,
handling each one without overlap, all without the generator or the
receiver knowing that it was happening.  The upside is that huge files
could be handled this way, but the downside is that the incremental-sync
algorithm would not find matches spanning the chunks.  It would be
interesting to test this and see if the rsync algorithm would be better
served by using a larger number of smaller chunks while segmenting the
file, rather than a smaller number of much larger chunks while
considering the file as a whole.

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Extremely poor rsync performance on very large files (near 100GB and larger)

2007-01-08 Thread Evan Harris


On Mon, 8 Jan 2007, Wayne Davison wrote:


On Mon, Jan 08, 2007 at 01:37:45AM -0600, Evan Harris wrote:


I've been playing with rsync and very large files approaching and
surpassing 100GB, and have found that rsync has excessively very poor
performance on these very large files, and the performance appears to
degrade the larger the file gets.


Yes, this is caused by the current hashing algorithm that the sender
uses to find matches for moved data.  The current hash table has a fixed
size of 65536 slots, and can get overloaded for really large files.
...


Would it make more sense just to make rsync pick a more sane blocksize for 
very large files?  I say that without knowing how rsync selects the 
blocksize, but I'm assuming that if a 65k entry hash table is getting 
overloaded, it must be using something way too small.  Should it be scaling 
the blocksize with a power-of-2 algorithm rather than the hash table (based 
on filesize)?


I know that may result in more network traffic as a bigger block containing 
a difference will be considered changed and need to be sent instead of 
smaller blocks, but in some circumstances wasting a little more network 
bandwidth may be wholly warranted.  Then maybe the hash table size doesn't 
matter, since there are fewer blocks to check.


I haven't tested to see if that would work.  Will -B accept a value of 
something large like 16meg?  At my data rates, that's about a half a second 
of network bandwidth, and seems entirely reasonable.


Evan
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html