Robin H. Johnson ([email protected]) wrote on Fri, Jan 29, 2016 at 03:45:13AM 
BRST:
> On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote:
> > The issue is not calculating checksums, it's I/O. The gentoo repository is 
> > now
> > 335GB. It's out of question to read it all at every update.
> > And you should know it!
> The primary module I care about is gentoo-portage (historically
> gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your
> inode space waste).  If your mirror DOESN'T have that entire rsync
> module sitting in cache already, you probably have fairly low traffic.

No, a big mirror like us has tens of repositories and more than 10 million
inodes. Gentoo is a small part of it.

> Hashing that much _was_ a CPU hit a decade ago, and most people did NOT
> have the memory to fit it all into cache either.

For mirroring many repositories the main problem has always been disk I/O,
particularly inodes. We know it because we do mirroring for about a decade
already.

> The other issues of mtimes of Manifests esp on ebuild removal have been
> resolved by the various series of patches, but part of those were
> artificially bumping the mtime by a single second in certain cases.
> 
> Those cases are STILL going to have a bumpy time in some situations
> because Git's commit time resolution is only 1 second (ditto many
> filesystems). Those situations either need sub-second resolution in the
> entire ecosystem or checksums of some form.
> 
> We've had ~27k commits in the last 6 months, of which:
> - 4811 colliding (commit timestamp only)
> - 45 colliding (author timestamp only)
> - 12 colliding (commit timestamp, author timestamp)
> 
> We're using author timestamps on the outgoing rsync files, and
> eventually we ARE going to have a real collision that hits users.
> 
> This didn't happen CVS because even if you were really fast (read: local
> to the CVS server, and did not go via SSH), you could only get it down
> to about 2 seconds per commit, and never in the same package. Two
> different devs could never a commit to the same package in the same
> second.

You'll have to deal with it at the repository building stage.

> > Also, we block --checksum from clients. Most big mirrors do it.
> The rsync cached checksums patch needs to get popular again, because
> then the mirrors won't have any huge burden at all:
> - update the checksums when syncing from the parent repo
> - compare against the checksums when queried by the client

It'd be really nice yes, but unfortunately it's much harder. One has to make
sure that the checksums match what's on disk, no matter how the update process
is interrupted and what happens upstream. The bulk of it is not difficult but
the corner cases are :-( The patch is surely a godsend to the origin of content
but not to the destination.

Anyway, I agree checksums are better, so much so that I DO USE checksums when
they're available, like Debian does. I won't use the --checksum option
for Gentoo or any other repository but if you provide a file with a
list of them at your repository the C3SL mirror will use them. The format of
the file should be like the md5/sha* one. These utilities include only regular
files, so you also have to provide another file with a list containing all
objects in the repository. Please use

   cd /root-of-repository && TZ=UTC rsync --no-h --list-only -r > 
/path/to/filelist

to create it, because it's easy to parse.

md5 is becoming increasingly vulnerable, so the Debian repository maintainers
are thinking about using other hashes. It seems sha512 is faster than sha256 on
64-bit machines, making it a good option. If you use md5sum the mirror job
here is simpler because rsync already does the check; for other hashes it's
harder at the mirror side because we have to calculate it after download but
the cost is small nowadays. I'm willing to do it and modify our script
accordingly.

Reply via email to