I've just committed changes to pkg_create that will help mirrors
synch by using much less bandwidth.

I just ran a final test. Rsynching a full amd64 snapshot now says something
like:
sent 7,315,796,510 bytes  received 40,292,721 bytes  4,517,095.01 bytes/sec
total size is 28,752,806,019  speedup is 3.91

A few months ago, after the "reorder files in packages", Stuart
Henderson commented that this would not help mirrors, but just the
end user, which got me thinking...

(Reminder: archives are compressed files. rsync does not peek inside the
compressed data, so its comparison algorithms don't work so well with them,
as the first different byte will change everything for the rest of the
archive, so no speed-up for compressed files).


I looked at the --rsyncable patch for zlib/gzip, and talked it over
with sthen@ and millert@, but pretty soon we discarded that idea.
That patch is brittle (every zlib version has got its own flavor of it,
with wild differences) and a nightmare to maintain. Plus it won't work
at all with other compression formats.

The solution was low-tech: simply cut the archive into more gzip chunks
(signatures already split the package into two parts, so we know the
tools work).  I chose 16 files as a simple guideline to experiment with.
There were still some discrepancies, such as tar timestamps metadata, which
is why those migrated to the plist a few weeks ago (side-effect: the tarball
effectively says everything dates back to the epoch... not so bad).

I was pleasantly surprised: the size increase is minimal (very much 
under 1%).

I also wacked on gzip timestamps, which don't serve any useful purpose 
either, especially since the plist signature also contains a timestamp (and
that one is signed, so it's ways more trustworthy).

Obviously, the first snapshot out will still copy everything. But from the
second one, mirror owners should see a difference.

To benefit:
- mirror owners must now use rsync algorithms. Turn off -W / --whole-file
if you were using it.
- turn on -y / --fuzzy, as this will "track" minor package version changes.

Note that this only applies to the "package snapshots" part of OpenBSD.

My test was a bit extreme: I did build two snaps with the exact same ports
tree, so the similarities are maximal. Nevertheless, there are lots of
*huge* packages in the ports tree.  So I expect the bandwidth gain to
be very significant anyway, especially for fast architectures which turn
up one snapshot a week or more. e.g., bandwidth use should be more than
halved, I expect.

Reply via email to