Shachar-
I agree with what you are
saying about trying to identify opportunities
to improve the algorithm - sorry
it seemd like a troll - that was certainly
not my intention. And certainly, my
apologies for getting the identities mixed
up!
As a quick FYI, the block
size absolutely has an impact on the effectiveness
of the checksum table - larger blocks means
fewer blocks, which means fewer hash colissions.
That said, however, I completely
agree that for very large files, the number
of buckets in the current implementation
is not optimal. Perhaps having a mode
for really large files would be appropriate.
One caution on increasing
the size of the hash: The current implementation
gives 16 bits of spread, so modding that
value with the desired bucket count would
work fine. However, if you choose to
start with the 32 bit rolling hash and mod
that, you will have problems. The rolling
checksum has two distinct parts, and modding
will only pull info from the low order bits,
which will likely get you considerably less
than even the 16 bits that the current implementation
gives.
I'd recommend using a real
32 bit hashing function to mix the two rolling
checksum components, instead of just appending
one to the other...
Just some comments that
may help!
- K
Original Message
--------------------
>
Kevin Day wrote:
> Shachar- >
> True enough - with one additional
thought - if the block size is set >
to be the square root of the file size, then
the load factor on the > hash table
becomes dynamic in and of itself (bigger
block size = less > master table entries
= fewer hash collisions).
And I'm
suggesting making it static, by adjusting
the hash table's size according to the
number of blocks. Just do "hashtablesize=(numblocks/8+1)*10;",
and you should be set.
> In the
case of relatively low bandwidth connections,
you will get MUCH > better performance
improvement by messing with the block size
than the > size of hash table, becaues
the hash table isn't sent over the wire -
> the block table IS sent over the
wire, so reducing it's size can have >
a big impact on performance if your file
isn't changing much.
True, but irrelevant.
The hash table performance does not come
at the cost of extra bandwidth, so I
see no reason not to optimize both.
>
In Andrew's original thesis, he looked at
several very large
...
>
The problem you face, of course, is
...
Kevin,
I think you are confusing a couple of things: 1.
It's not me with the big files. It's Lars.
I can't run tests on files I don't have.
I am merely trying to figure out what stopped
Lars so that rsync can be better. 2.
The size of each block have nothing to do
with the question of hash table size.
Once you've chosen the number of blocks your
file will have, in whatever way you did,
there is an unrelated question of what hash
table size you should use. Using a 65536
buckets hash table on a 500GB divided
into 64KB blocks (as Lars is using) means
you have, on average, 125 collisions
per bucket. Regardless of the question of
whether using this size for blocks is
smart or not, rsync could handle it better.
That's what I'm talking about.
>
Trying to increase the size of the hash table
may just not be worth it > - are you
certain that the performance hit you are
experiencing
I'm not. Lars is.
>
is caused by processing on the recipient
side, and not data transfer > of the
block table? In my testing (which is
actually with my own > implementation
of the algorithm, so I may have optimizations/
or lack > thereof compared to the
rsync you are running), the block table >
transfer is the biggest cause of elapsed
time for big files that don't > change
much.
Don't forget that Lars manages
to transfer the whole file in 17 hours. I
doubt transferring some information about
each block takes more than the 64K the
block itself is (as is Lars' case).
>
It may be that the square root of file size
for block size isn't > appropriate
for files as big as you are working with...
It
certainly is, as I don't work with such large
files. Lars is. While at it, he is not
using square root. He is using 64KB blocks.
>
Keyin, I'm trying to make rsync
better. Lars' problem is an opportunity to
find a potential bottleneck. Trying to solve
his use of possibly non-optimal values
won't help rsync, though it will help him.
Let's keep this part of this thread on
it's subject - is the hash table size optimal?
I don't see how modifying the block sizes
is going to change this question.
Shachar
--
Shachar Shemesh Lingnu Open Source
Consulting ltd. Have you backed up today's
work? http://www.lingnu.com/backup.html
<
--------------------
|