Hi, all - 

I've been digging into the question of using rsync to sync up compressed data, 
and read with interest the conversations regarding the rsync friendly patch 
available for gzip (I belive Rusty provided this patch).  For anyone 
interested, the message thread is archived at 
http://lists.samba.org/archive/rsync/2002-June/thread.html.  The broad outline 
of this approach were mentioned in Andrew Tridgell's PhD thesis, and explained 
in more depth in a transcript of a talk he gave.

My initial thought on reading this was "no way this will work" - then I dug in, 
and convinced myself that it would indeed work.  After testing Rusty's patch, I 
am extremely convinced that the approach works - very elegent.

At that point, I started to consider the question raised in the discussion 
threads by Kevin Easton - primarily:  are the checksum and compression block 
trigger selected by Rusty optimal?

To refresh memories:  Rusty's checksum algorithm is to compute the sum of the 
previous 4096 bytes of source data, and his block trigger kicks in when the 
checksum is divisible by 4096 (i.e. SUM mod 4096 == 0).

Because the value of 4096 prior bytes seemed fairly arbitrary, and because 
modding by the same value as the window size seemed somewhat arbitrary, I 
decided to do some parametric tests with a couple of different checksum 
strategies, window sizes and block trigger mask sizes.

The following is pretty long - I apologize for making my first post on the 
topic so extensive - but the results are interesting, and I figured I'd lay out 
everything that I have so far to see if anyone a) is interested or b) has any 
ideas of directions to pursue...

Test Description
==========

Here are some definitions:

SUM = the value of the rolling checksum at a given point in the input stream
Window size = the number of bytes in the input stream that are considered for 
the rolling check sum
Block mask size = The size of value used in checking the current SUM to see if 
the current compression block should be wrapped up
Compression block size = The number of bytes from the input stream between 
checksum triggers

I decided to test using two forms of rolling checksum:

1.  Rusty's strategy of just adding the values from the input stream
2.  Andrew's rolling checksum that is used in the rsync algorithm itself


For starters (I'm still working on processing tests on a larger data set), I 
chose a 28 MB database file (dBase file) - fixed record size, lot's of 
compression opportunities.

I then constructed a parametric study to see the impact of window size, block 
mask size and algorithm based on the database file as an input file.  The 
output of the tests is the average compression block size, and the standard 
deviation of the compression block size.

For the initial tests here (more follow!), this is run in a test harness that 
just computes effective block sizes based on the two checksum algorithms - it 
is NOT running the patched zlib/gzip code directly (I do that testing later 
on).  The results should help us understand the dependencies between different 
parameters, which will help determine how to pursue future testing.

My thinking here is as follows:

Average compressed block size
------------------------------------
The average compression block size is going to pretty much dictate the overall 
effectiveness of the zlib/gzip underlying compression algorithm.  Ideally, we 
want to have a control that we can tune to adjust this value so we get decent 
performance from zlib, but not have blocks that are too big (which would reduce 
the effectiveness of the rsync operation itself).  The zlib algorithm's 
performance is non-linear with respect to the compression block size, so 
finding this balance point is particularly important from an optimization 
perspective.  In general, I find that block sizes around 4000 - 5000 seem to be 
about optimal - anything below that, and the compression drops off.  Anything 
above, and the compression doesn't improve much.

Std deviation of compressed block size
--------------------------------------------
I initially thought that the standard deviation of the compression block size 
would be important because of the non-linear behavior of the zlib compression 
algorithm with respect to block size.  If our standard deviation is too high, 
then we might have a great average block size, but most blocks will be 
significantly above or below the average, so the effective performance of both 
the compression and rsync pieces of the problem will be far from optimal.

Another way of putting this:  If I say that there isn't a good deal of 
improvement in compression ratio for compression block sizes above 4000 bytes, 
then I would optimally choose my checksum algorithm and parameters so that 4000 
= (Xavg - Xsigma).  If I have a small Xsigma, that allows me to have a smaller 
Xavg.  And smaller Xavg *should* translate into better rsync performance.  I 
have NOT tested this yet, but it is certainly an obvious next step.



There is one more consideration:  If a small change is made to a file, what are 
the chances of that change messing up 2 checksum blocks instead of 1?  In 
straghtforward rsync, messing up one or two blocks is not that big of a deal.  
Once the data is compressed, I'm thinking that it would tend to be amplified.  
I haven't done any statistics on this, and I could be compeltely off base - but 
my initial intuition is that having a smaller window size would lead to better 
ultimate rsync performance...  I would be extremely interested in anyone else's 
thoughts on this point...


Results
=====

So - on to the testing:

For those of you who like making pretty graphs (actually, graphing this really 
helps to hilight the trends), here are some numbers from my tests (I apologize 
in advance for the formatting on these tables - fighting with formatting in an 
ASCII file is no fun - please bear with me!):

The following data is for window size = 1298

[Block Mask Size] [Avg Block Size (R )] [Std Dev Block Size (R)] [Ratio (R )] 
[Avg Block Size (S )] [Std Dev Block Size (S)] [Ratio (S)]
1024 2378 1173 0.49 2551 1393 0.55
1524 2767 1644 0.59 3049 1946 0.64
2024 3163 2039 0.64 3708 2639 0.71
2524 3783 2680 0.71 4354 3219 0.74
3024 4226 3151 0.75 4950 3928 0.79
3524 4535 3464 0.76 5598 4594 0.82
4024 5030 3824 0.76 6257 5190 0.83


Discussion of results
=============
My initial results (though far from complete), show some interesting things:

1.  For small window sizes (< 1000 bytes), the behavior of the simple addition 
checksum is quite nasty.  The block size standard deviation is huge and the 
average block size is all over the place.  As the window size grows above ~1000 
bytes, things settly out and behave themselves much better.

In contrast, the rsync rolling checksum was very well behaved, even down to 
window sizes of 25.  The standard deviations for the rsync rolling checksum are 
a bit high (150% of the average block size), but taper off to a constant value 
of ~100% of the average block size when the window is > 100 bytes.

For the rest of the analysis, I've selected 1298 as the window size, although I 
could have used a value as low as 100 for the RSync checksum and still gotten 
the same results.

2.  Once you have sufficient window size, it stops becoming a parameter for 
consideration, and the Block Mask Size becomes the "dial" that allows you to 
control both the average compression block size and the standard deviation.  
The correlation between the two is pretty much linear, with the RSync style 
checksum having a more controlled result.

3.  In general, the Rsync checksum had a lower average block size and a lower 
block size standard deviation for a given block mask.  The overall trend of the 
ratio between the average and standard deviation is pretty much independent of 
algorithm selected, but there is an approx 5-10% increase in standard deviation 
as a percentage of average block size for the simple chechsum calculation 
relative to the rsync checksum.

4.  The large scale behavior of the ratio between average and standard 
deviation is definitely dependent on Block Mask Size.  Higher mask values 
result in larger standard deviation - but is non-linear, and appears to taper 
off to about 85% for the simple checksum and 75% for the rsync checksum.

Given that we want to lower the standard deviation, this result implies that 
the block mask size should be selected to be as low as possible.

5.  Earlier, I mentioned that the value that probably impacts the performance 
of the compression algorithm the most is going to be (Xavg - Xstddev).  If we 
look at that value as a function of BlockMaskSize, we find something quite 
interesting - there is *almost* no dependency on Block Mask Size!  The effects 
of the increase in standard deviation pretty much wipe out any advantage of the 
increased average.  For example, for the RSync checksum, varying the block mask 
size from 1024 to 4024 causes the (Xavg - Xstddev) value to vary between 1200 
to 1070 - not much!

If we look at (Xavg - Xstddev) as a function of window size, we find that there 
is no dependency at all once we go above some value of window size (the value 
of that cutoff is dependent on the checksum algorithm selected)

The conclusion here (it turns out that this conclusion is wrong - see below) is 
that changing the window size and block mask size probably will not have a 
significant effect on the performance of the compression algorithm, as long as 
the window size is above the point where the results are statistically 
unpredictable.  For the RSync checksum, this point is approximately 100 bytes.  
For the simple checksum, this point is approximately 500 bytes.  

Here's the data crunched to show this (values marked #VALUE! are where the std 
deviation was too big to calculate using the 64 bit integer math in my test 
application):

[WinSize] [Xavg-Xstdev (R )] [Xavg-Xstdev (S )]
25 791.3509857 #VALUE!
30 854.7655263 #VALUE!
36 847.0310926 #VALUE!
43 963.7640702 #VALUE!
51 1013.541078 #VALUE!
61 997.480154 1301.475001
73 1036.707541 1624.38203
87 1068.007719 #VALUE!
104 1082.094327 #VALUE!
124 1113.082926 1084.712182
148 1113.28381 1137.304064
177 1106.720093 2215.686195
212 1116.509118 962.0990605
254 1113.412682 #VALUE!
304 1132.536022 1019.862558
364 1118.921432 991.9188742
436 1104.036368 1062.314435
523 1100.639225 1098.862742
627 1112.34673 1024.872196
752 1127.374869 1104.576236
902 1107.316805 1093.34639
1082 1078.665537 1104.471305
1298 1129.472404 1079.824598
1557 1106.018974 1060.80879
1868 1117.813407 1108.882505
2241 1120.976969 1101.694318
2689 1114.43462 1143.649274
3226 1106.447702 1121.075456
3871 1122.427955 1076.250929



This leads to one obvious test:  Try Rusty's algorithm, and vary the block size 
and window size , and see how the compression ratio is effected.  If the 
conclusion above is correct, then messing with the window size should have no 
impact at all, and messing with the block size should actually hurt the 
compression ratio a little bit with increasing block size.

I was quite surprised to find that the results of that test did not line up 
with my expectations.  Actually running the data through the zlib compression 
algorithm (adjusted to use the simple checksum calculation and allow for 
varying window size and block size independently), gives the following results:

1.  As expected, the compression ratio is independent of the window size, as 
long as the window size is above the minimum critical size for the the checksum 
algorithm.

2.  However, the compression ratio IS dependent on block mask size.  If we 
choose a window size of 450 (anything above 350 will do), we see that the 
compression ratio varies from 4.5 to 4.9 as the mask size increases from 1000 
to 4000.  Masks above 4000 do not appear to have much of an impact on effective 
compression ratio.  I suspect that this test result is what led Rusty to choose 
the window/block size of 4096 in his patch.

The reason in the difference of #2 with my original expectations can probably 
be attributed to my lack of understanding of exactly how the zlib algorithms 
work.




RSync testing
========

Given the above results, I thought it would be interesting to test my 
hypothesis about the effect of window size on the efficiency of the rsync 
alogorithm.

I ran a suite of tests as follows:

1.  Original data used is the same dBase database file as for the prior tests
2.  Modified data is the original data with some miscellaneous changes 
throughout.  To give a feel for the extent of the changes, running the rsync 
algorithm against the uncompressed data in these two files with an RSBlock size 
of 1000 gives a speedup of approximately 4000:1 - so not a huge amount of 
changes.

I then ran a series of tests where I compute the speed up for a set of 
different rsync block sizes and window sizes.  In my initial testing, I 
adjusted the window size and block mask size in lock step (i.e. using Rusty's 
original algorithm).


Results
=====
I won't spend a lot of time looking at the effect of rsync block size on the 
speed up - that has been studied pretty well already, but just so we have the 
data points, here are the results assuming a fixed compression window size of 
4000 (which is pretty close to Rusty's rsyncable patch implementation):

[RS Block Size] [Speed Up]
500 194.49103
1000 179.82353
1500 164.90042
2000 169.59175
2500 154.23415


For the rest of this analysis, I'm going to set the rsync block size to 1000 
and leave it there.


If we look at RSync performance as a function of window size, we find, as 
expected, that the rsync performance decreases.  The curve isn't smooth (the 
compression algorithm causes all sorts of havoc here), but the general trend is 
clear:  Increasing the compressed window size and block mask size = decreased 
rsync performance.  What isn't clear (because I varied the window size and 
block mask size together), is which of the two actually is the driving force 
(window size or block mask size).  My initial results with fiddling with the 
compression performance indicate that it should be the block mask size that 
matters, and the window size shouldn't matter.

Here are the test results:

[Window AND block mask Size] [Speed Up]
100 335.6063
600 335.56348
1100 264.708
1600 273.0129
2100 220.42537
2600 226.33165
3000 219.47977
3500 225.90854
4000 179.82353
4500 195.95692
5000 230.86983



My initial thoughts here are:
1.  If the block mask size is mostly responsible for determining the 
performance of the zlib compression algorithm and if the window size is mostly 
responsible for determining the performance of the rsync algorithm, then we may 
have an opportunity to optimize the performance of the current --rsyncable 
patch.  The test results above imply that we could be looking at a 25% 
improvement in rsync performance without impacting compression or significantly 
adjusting Rusty's algorithm.  I think that the test results so far tend to 
support this as a possibility.

2.  Further, if we can get really small window sizes by switching to the 
original rsync rolling checksum (instead of using the simple checksum in the 
current patch), then we *might* be able to achieve even better optimization, 
without adding a lot of computation overhead (the overhead of the rsync 
algorithm compared to the simple computation used now should be pretty low 
compared to the rest of what's going on in the zlib computation).  

Anyway - that's where I'm headed with this.  I'll have results to indicate 
whether #1 is worth pursuing (i.e. testing with many other files) tomorrow.  #2 
will require a bit more coding to the --rsyncable patch, but it should be 
relatively simple to do.



That's all I have for now - I'm running tests this evening to independently 
adjust the window and block mask size, and to test a different data file just 
to make sure I'm not way off-base (i.e. make sure the trends indicated by my 
results are not dependent on the input file).

I'd love to hear any thoughts on the matter!

- Kevin
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to