Shachar-
 
I agree with what you are saying about trying to identify opportunities to improve the algorithm - sorry it seemd like a troll - that was certainly not my intention.  And certainly, my apologies for getting the identities mixed up!
 
As a quick FYI, the block size absolutely has an impact on the effectiveness of the checksum table - larger blocks means fewer blocks, which means fewer hash colissions.
 
That said, however, I completely agree that for very large files, the number of buckets in the current implementation is not optimal.  Perhaps having a mode for really large files would be appropriate.
 
One caution on increasing the size of the hash:  The current implementation gives 16 bits of spread, so modding that value with the desired bucket count would work fine.  However, if you choose to start with the 32 bit rolling hash and mod that, you will have problems.  The rolling checksum has two distinct parts, and modding will only pull info from the low order bits, which will likely get you considerably less than even the 16 bits that the current implementation gives.
 
I'd recommend using a real 32 bit hashing function to mix the two rolling checksum components, instead of just appending one to the other...
 
Just some comments that may help!
 
- K
 
 
 
Original Message
--------------------
> Kevin Day wrote:

> Shachar-
>  
> True enough - with one additional thought - if the block size is set
> to be the square root of the file size, then the load factor on the
> hash table becomes dynamic in and of itself (bigger block size = less
> master table entries = fewer hash collisions).

And I'm suggesting making it static, by adjusting the hash table's size
according to the number of blocks. Just do
"hashtablesize=(numblocks/8+1)*10;", and you should be set.

> In the case of relatively low bandwidth connections, you will get MUCH
> better performance improvement by messing with the block size than the
> size of hash table, becaues the hash table isn't sent over the wire -
> the block table IS sent over the wire, so reducing it's size can have
> a big impact on performance if your file isn't changing much.

True, but irrelevant. The hash table performance does not come at the
cost of extra bandwidth, so I see no reason not to optimize both.

> In Andrew's original thesis, he looked at several very large

...

> The problem you face, of course, is

...

Kevin, I think you are confusing a couple of things:
1. It's not me with the big files. It's Lars. I can't run tests on files
I don't have. I am merely trying to figure out what stopped Lars so that
rsync can be better.
2. The size of each block have nothing to do with the question of hash
table size. Once you've chosen the number of blocks your file will have,
in whatever way you did, there is an unrelated question of what hash
table size you should use. Using a 65536 buckets hash table on a 500GB
divided into 64KB blocks (as Lars is using) means you have, on average,
125 collisions per bucket. Regardless of the question of whether using
this size for blocks is smart or not, rsync could handle it better.
That's what I'm talking about.  

> Trying to increase the size of the hash table may just not be worth it
> - are you certain that the performance hit you are experiencing

I'm not. Lars is.

> is caused by processing on the recipient side, and not data transfer
> of the block table?  In my testing (which is actually with my own
> implementation of the algorithm, so I may have optimizations/ or lack
> thereof compared to the rsync you are running), the block table
> transfer is the biggest cause of elapsed time for big files that don't
> change much.

Don't forget that Lars manages to transfer the whole file in 17 hours. I
doubt transferring some information about each block takes more than the
64K the block itself is (as is Lars' case).  

> It may be that the square root of file size for block size isn't
> appropriate for files as big as you are working with...

It certainly is, as I don't work with such large files. Lars is. While
at it, he is not using square root. He is using 64KB blocks.

>  

Keyin, I'm trying to make rsync better. Lars' problem is an opportunity
to find a potential bottleneck. Trying to solve his use of possibly
non-optimal values won't help rsync, though it will help him. Let's keep
this part of this thread on it's subject - is the hash table size
optimal? I don't see how modifying the block sizes is going to change
this question.

         Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html


<
--------------------
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to