Robin H. Johnson ([email protected]) wrote on Fri, Jan 29, 2016 at 03:45:13AM BRST: > On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote: > > The issue is not calculating checksums, it's I/O. The gentoo repository is > > now > > 335GB. It's out of question to read it all at every update. > > And you should know it! > The primary module I care about is gentoo-portage (historically > gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your > inode space waste). If your mirror DOESN'T have that entire rsync > module sitting in cache already, you probably have fairly low traffic.
No, a big mirror like us has tens of repositories and more than 10 million inodes. Gentoo is a small part of it. > Hashing that much _was_ a CPU hit a decade ago, and most people did NOT > have the memory to fit it all into cache either. For mirroring many repositories the main problem has always been disk I/O, particularly inodes. We know it because we do mirroring for about a decade already. > The other issues of mtimes of Manifests esp on ebuild removal have been > resolved by the various series of patches, but part of those were > artificially bumping the mtime by a single second in certain cases. > > Those cases are STILL going to have a bumpy time in some situations > because Git's commit time resolution is only 1 second (ditto many > filesystems). Those situations either need sub-second resolution in the > entire ecosystem or checksums of some form. > > We've had ~27k commits in the last 6 months, of which: > - 4811 colliding (commit timestamp only) > - 45 colliding (author timestamp only) > - 12 colliding (commit timestamp, author timestamp) > > We're using author timestamps on the outgoing rsync files, and > eventually we ARE going to have a real collision that hits users. > > This didn't happen CVS because even if you were really fast (read: local > to the CVS server, and did not go via SSH), you could only get it down > to about 2 seconds per commit, and never in the same package. Two > different devs could never a commit to the same package in the same > second. You'll have to deal with it at the repository building stage. > > Also, we block --checksum from clients. Most big mirrors do it. > The rsync cached checksums patch needs to get popular again, because > then the mirrors won't have any huge burden at all: > - update the checksums when syncing from the parent repo > - compare against the checksums when queried by the client It'd be really nice yes, but unfortunately it's much harder. One has to make sure that the checksums match what's on disk, no matter how the update process is interrupted and what happens upstream. The bulk of it is not difficult but the corner cases are :-( The patch is surely a godsend to the origin of content but not to the destination. Anyway, I agree checksums are better, so much so that I DO USE checksums when they're available, like Debian does. I won't use the --checksum option for Gentoo or any other repository but if you provide a file with a list of them at your repository the C3SL mirror will use them. The format of the file should be like the md5/sha* one. These utilities include only regular files, so you also have to provide another file with a list containing all objects in the repository. Please use cd /root-of-repository && TZ=UTC rsync --no-h --list-only -r > /path/to/filelist to create it, because it's easy to parse. md5 is becoming increasingly vulnerable, so the Debian repository maintainers are thinking about using other hashes. It seems sha512 is faster than sha256 on 64-bit machines, making it a good option. If you use md5sum the mirror job here is simpler because rsync already does the check; for other hashes it's harder at the mirror side because we have to calculate it after download but the cost is small nowadays. I'm willing to do it and modify our script accordingly.
