On Sat, Aug 02, 2003 at 08:57:35PM -0400, David Walser wrote:
> It's true that things could better. But some packages are more the
> same than just the header. I've seen large parts of packages saved
> from being downloaded. Believe me, though it's not ideal for rsync,
> rpmsync still saves a significant amount of bandwidth over straight
> rsync.
Even if they are the same they are compressed... Because they are
compressed even a small difference will make it difficult for rsync to
find any usable differences. The document I already provided explains
that.
If you'd transfered every binary package in the main tree you'd save
60MB:
echo $(( `packdrake --cat hdlist.cz | wc -c` / 1024 / 1024))
for contrib:
43MB:
echo $(( `packdrake --cat hdlist2.cz | wc -c` / 1024 / 1024))
Of course this is assuming you're only carrying i586 and aren't carrying
the SRPMS. It sounds like a lot until you realize you're not moving
most of those files and even then some of that header information
changes everytime.
Let's look at a real world example. Updating
gzip-1.2.4a-11.2mdk.i586.rpm to gzip-1.2.4a-12mdk.i586.rpm. I chose
this package because everyone will have access to the sourc files.
11.2mdk is in updates for 9.1, 12mdk is in cooker.
First we download the original:
>>>>
[EMAIL PROTECTED] root]# rsync -P --stats
mirror::Mandrake/updates/9.1/RPMS/gzip-1.2.4a-11.2mdk.i586.rpm .
67560 100% 32.22MB/s 0:00:00
rsync[329] (receiver) heap statistics:
arena: 41624 (bytes from sbrk)
ordblks: 3 (chunks not in use)
smblks: 0
hblks: 0 (chunks from mmap)
hblkhd: 0 (bytes from mmap)
usmblks: 0
fsmblks: 0
uordblks: 40120 (bytes used)
fordblks: 1504 (bytes free)
keepcost: 1464 (bytes in releasable chunk)
Number of files: 1
Number of files transferred: 1
Total file size: 67560 bytes
Total transferred file size: 67560 bytes
Literal data: 67560 bytes
Matched data: 0 bytes
File list size: 47
Total bytes written: 135
Total bytes read: 67691
wrote 135 bytes read 67691 bytes 135652.00 bytes/sec
total size is 67560 speedup is 1.00
<<<<
The speedup is 1.00 meaning the archive was fully transfered.
Now let's move the file to the 12mdk file name and fetch it:
>>>>
[EMAIL PROTECTED] root]# mv gzip-1.2.4a-11.2mdk.i586.rpm
gzip-1.2.4a-12mdk.i586.rpm
[EMAIL PROTECTED] root]# rsync -P --stats
mirror::Mandrake-devel/cooker/i586/Mandrake/RPMS/gzip-1.2.4a-12mdk.i586.rpm
.
67668 100% 21.51MB/s 0:00:00
rsync[1265] (receiver) heap statistics:
arena: 111256 (bytes from sbrk)
ordblks: 3 (chunks not in use)
smblks: 1
hblks: 0 (chunks from mmap)
hblkhd: 0 (bytes from mmap)
usmblks: 0
fsmblks: 48
uordblks: 40120 (bytes used)
fordblks: 71136 (bytes free)
keepcost: 71040 (bytes in releasable chunk)
Number of files: 1
Number of files transferred: 1
Total file size: 67668 bytes
Total transferred file size: 67668 bytes
Literal data: 66268 bytes
Matched data: 1400 bytes
File list size: 45
Total bytes written: 736
Total bytes read: 66409
wrote 736 bytes read 66409 bytes 134290.00 bytes/sec
total size is 67668 speedup is 1.01
<<<<
Speed up is 1.01 with 1400 bytes matching that were not transfered. So
all we saved from that transfer was 1.4k. About 2% of the file.
However, it should be noted that gzip is a relatively small package, so
its headers are going to represent a relatively high percentage of
header data compared to archive data. Even still because of the
overhead we only saved 1259 bytes because of rsync overhead... Which
comes out to about 1.8% savings.
Using a perl script I wrote I extract the archive from the rpm script:
[EMAIL PROTECTED] root]# perl getarchive.pl gzip-1.2.4a-12mdk.i586.rpm
[EMAIL PROTECTED] root]# ls -l plain
-rw-r--r-- 1 root root 60018 Aug 2 19:38 plain
[EMAIL PROTECTED] root]# file plain
plain: gzip compressed data, from Unix
The file plain is the archive. Subtracing the plain file size from the
rpm size we see that the header was 7650 bytes long. But were were only
able to avoid tranfering 1400 bytes. Why is there a discrepency? A
couple reasons. The header file stores the GPG, MD5 and SIZE
signatures, these will almost always be unique. However the biggest
reason is the nature of the RPM header format. There is an index which
specifies the type, size and position of various pieces of data within
the header. However the ordering of this index and the ordering of the
files in the storage area of the header are not guaranteed to be the
same. Moving these data pieces around will make rsync unlikely to be
able to locate the matching data.
It should be noted that gzip is a relatively small package. As a result
the percentage of its content that will be taken up by the header will
be larger than larger packages.
So let's try a larger package, apache2.
>>>>
[EMAIL PROTECTED] root]# rsync -P --stats
mirror::Mandrake/updates/9.1/RPMS/apache2-2.0.47-1.1mdk.i586.rpm .
179550 100% 28.54MB/s 0:00:00
rsync[4862] (receiver) heap statistics:
arena: 41624 (bytes from sbrk)
ordblks: 3 (chunks not in use)
smblks: 0
hblks: 0 (chunks from mmap)
hblkhd: 0 (bytes from mmap)
usmblks: 0
fsmblks: 0
uordblks: 40120 (bytes used)
fordblks: 1504 (bytes free)
keepcost: 1464 (bytes in releasable chunk)
Number of files: 1
Number of files transferred: 1
Total file size: 179550 bytes
Total transferred file size: 179550 bytes
Literal data: 179550 bytes
Matched data: 0 bytes
File list size: 49
Total bytes written: 137
Total bytes read: 179695
wrote 137 bytes read 179695 bytes 359664.00 bytes/sec
total size is 179550 speedup is 1.00
[EMAIL PROTECTED] root]# mv apache2-2.0.47-1.1mdk.i586.rpm
apache2-2.0.47-4mdk.i586.rpm
[EMAIL PROTECTED] root]# rsync -P --stats
mirror::Mandrake-devel/cooker/i586/Mandrake/RPMS/apache2-2.0.47-4mdk.i586.rpm
.
181140 100% 7.85MB/s 0:00:00
rsync[31168] (receiver) heap statistics:
arena: 41624 (bytes from sbrk)
ordblks: 3 (chunks not in use)
smblks: 1
hblks: 0 (chunks from mmap)
hblkhd: 0 (bytes from mmap)
usmblks: 0
fsmblks: 48
uordblks: 40136 (bytes used)
fordblks: 1488 (bytes free)
keepcost: 1400 (bytes in releasable chunk)
Number of files: 1
Number of files transferred: 1
Total file size: 181140 bytes
Total transferred file size: 181140 bytes
Literal data: 166440 bytes
Matched data: 14700 bytes
File list size: 47
Total bytes written: 1698
Total bytes read: 166671
wrote 1698 bytes read 166671 bytes 336738.00 bytes/sec
total size is 181140 speedup is 1.08
[EMAIL PROTECTED] root]# perl getarchive.pl apache2-2.0.47-4mdk.i586.rpm
[EMAIL PROTECTED] root]# file plain
plain: gzip compressed data, from Unix
[EMAIL PROTECTED] root]# ls -l plain
-rw-r--r-- 1 root root 158157 Aug 2 19:53 plain
<<<<
This time we got slightly better results. But still we only saved 14700
bytes of matched data from being sent. About 8% of the file. The
header was 22983 bytes in this case. Probably the reason it was
slightly better in this case was was the size of the changelog for
apache2 which was fairly large...
So let's adjust our estimates down from above where I estimated based on
the hdlist size. Let's guess we can save transfering about 50% of the
headers. That means that if every package in main got changed we'd save
about 31MB and for contrib about 22MB.
But not everything gets updated in the tree on a daily basis. A rather
generous estimate of 20% of the packages get updated on a daily basis
means: 6MB for main, 4.4MB for contrib on a daily basis.
All the RPMS from the main tree are 2.3GB, contrib, 2.4G. Again
assuming rougly 20% is updated on a daily basis. Which comes out to a
total transfer of 471MB and 491MB respectively. 6MB/471MB = 1.3%
savings, 4.4MB/491MB = 0.8%. I don't call that a whole lot. Assuming a
500kb/s connection you'd save 101 seconds for main and 67 seconds for
contrib.
And I'm still making the assumption that every replaced package is
getting renamed and that 50% of headers is being saved. Which is
probably really generous.
> Another way it saves bandwidth is when packages are moved between
> sources, like contrib and main. rpmsync will mv them. rsync would
> delete them from one location, and download them cold to the other.
This would save some bandwidth. But honestly how often does this really
happen? I see a whole lot of deleting packages, and then eventually
uploading new ones (e.g. library major changes). Which if you synced
inbetween you'd end up pulling the whole new file later.
I might think it was worth it if I was on dialup. But all in all I
wouldn't call this a lot of bandwidth.
--
Ben Reser <[EMAIL PROTECTED]>
http://ben.reser.org
"What upsets me is not that you lied to me, but that from now on I can
no longer believe you." -- Nietzsche