Re: Reducing rsync cost
On 11/24/10 1:23, Nicholas Clark wrote: The symlink tree is built by scripts, isn't it? Are they available? Actually; since earlier this year it's not being built anymore. I promised to write a note announcing it but haven't gotten around to it. So inadvertently we got to test if anyone cared. The answer is no. :-) Nobody is navigating the CPAN via FTP anymore, so they serve ~no useful purpose. The longer term plan is to get rid of first all the symlink files (although they should get pruned as the files the links point to are removed) and eventually the directories (in some number of years). I imagine it might be useful to put in index.html and README files in each directory with a relevant pointer to http://search.cpan.org/search?q=$foo - but I don't have time to rig that up. - ask
Re: Reducing rsync cost
Ask Bjørn Hansen a...@develooper.com writes: On 11/24/10 1:23, Nicholas Clark wrote: The symlink tree is built by scripts, isn't it? Are they available? So inadvertently we got to test if anyone cared. The answer is no. :-) I can't say I'm not anyone, but apparently one of very few :( . -- Johan
Re: Reducing rsync cost
On Wednesday 24 November 2010 10:51:25 David Golden wrote: The new fast CPAN mirrors use File::Rsync::Mirror::Recent, which uses the new RECENT.* files to manage the synchronization process. Those files record recent changes (adds/deletes) to the frequently changing authors/ and modules/ directories. The fast mirrors use those files to sync with PAUSE every minute or so with very low overhead. [...] See http://tinyurl.com/35t9u3k for instructions on using F::R::M::Recent. Thanks for the heads-up; I'd not seen that approach. That certainly makes a lot of sense! Since at a cursory glance it seems none of the current fast mirrors are in the UK, I'll drop a mail to c...@cpan.org offering a UK mirror to take part :) Cheers Dave P
Re: Reducing rsync cost
On Tue, Nov 23, 2010 at 10:24:18PM +0100, David Landgren wrote: On 22/11/2010 15:18, David Nicol wrote: On Mon, Nov 22, 2010 at 4:37 AM, David Landgrenda...@landgren.net wrote: Yeah, this is the killer. In an ideal world, we would kill the symlinks such as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These could be recreated via shell scripts locally on mirrors for people who wish to maintain these legacies. Cutting that out would diminish the rsync burden considerably. David or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the mirroring process as a database mirror via a TBD compact database diff protocol (I have no intention of doing any of this myself; good morning) Well... I guess that's not going to happen then, is it? I shouldn't even bother replying, but I wouldn't want the archives to think that silence indicates tacit agreement. The symlink tree is built by scripts, isn't it? Are they available? Because the nice thing about your suggestion is that it doesn't involve changing any of the server infrastructure, and it's an incremental change which can be done by each mirror in turn. Instead of running rsync over the whole tree, it can change to run a top level script that runs rsync over the parts that have to be copied, and then run the symlink generation on the parts that can be recreated locally. Nicholas Clark
Re: Reducing rsync cost
On Wed, Nov 24, 2010 at 4:23 AM, Nicholas Clark n...@ccl4.org wrote: Instead of running rsync over the whole tree, it can change to run a top level script that runs rsync over the parts that have to be copied, and then run the symlink generation on the parts that can be recreated locally. The new fast CPAN mirrors use File::Rsync::Mirror::Recent, which uses the new RECENT.* files to manage the synchronization process. Those files record recent changes (adds/deletes) to the frequently changing authors/ and modules/ directories. The fast mirrors use those files to sync with PAUSE every minute or so with very low overhead. The fast mirrors admins are authorized by Andreas to hit PAUSE directly, but I believe that anyone can use it with open CPAN mirrors offering rsync service. While it needs to run as a daemon, as of version 0.0.8-TRIAL, all the memory intensive work happens in child processes and the main daemon is pretty lightweight. (Mine is holding at about 8.8 MB of memory). The current list of fast mirrors are cpan.shadowcatprojects.net, cpan.dagolden.com, cpan.hexten.net and cpan.cpantesters.org. While you probably shouldn't hit those every minute without checking with the admins (e.g. me for cpan.dagolden.com) you can probably use F::R::M::Recent to hit them several times an hour with no problem. See http://tinyurl.com/35t9u3k for instructions on using F::R::M::Recent. Regards, David
Re: Reducing rsync cost
On Tue, 23 Nov 2010 22:24:18 +0100, David Landgren da...@landgren.net said: I shouldn't even bother replying, but I wouldn't want the archives to think that silence indicates tacit agreement. To give you an update: five tier-1 CPAN sites are pulling a sync every 20 seconds and every sync takes virtually no time (we could sync every second if we had to). So this part of the problem is solved and it works since 20 months now. We still have to go quite a long way until the whole CPAN infrastructure works as well but the priorities have shifted since the tier-1 problem was the most pressing one. -- andreas
Re: Reducing rsync cost
On 22/11/2010 15:18, David Nicol wrote: On Mon, Nov 22, 2010 at 4:37 AM, David Landgrenda...@landgren.net wrote: Yeah, this is the killer. In an ideal world, we would kill the symlinks such as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These could be recreated via shell scripts locally on mirrors for people who wish to maintain these legacies. Cutting that out would diminish the rsync burden considerably. David or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the mirroring process as a database mirror via a TBD compact database diff protocol (I have no intention of doing any of this myself; good morning) Well... I guess that's not going to happen then, is it? I shouldn't even bother replying, but I wouldn't want the archives to think that silence indicates tacit agreement. David
Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)
On 19/11/2010 20:57, dhu...@hudes.org wrote: source code, even 100KLOC? Once you go to .gz you're already at better than 2:1. What are you going to save by going to even 3:1, 10Kbytes? compared to the nuisance inflicted, it's nothing. Over the entire CPAN archive, it'd be significant... I agree on the individual case it's probably not worth worrying about too much. But if it's easy to use .bz2 or something better it wouldn't hurt to get that word out. (And it may be worth making it easy, though I'm not sure about that.) Daniel T. Staal Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between mirrors. Compressing to .bz2 won't help that: the stress is doing a stat on every single file in CPAN not the transfer. Work toward optimizing the mirror distribution instead of worrying about bz2 vs gz. Remember not Yeah, this is the killer. In an ideal world, we would kill the symlinks such as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These could be recreated via shell scripts locally on mirrors for people who wish to maintain these legacies. Cutting that out would diminish the rsync burden considerably. David -- There's bum trash in my hall and my place is ripped I've totaled another amp, I'm calling in sick
Re: Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)
On Mon, Nov 22, 2010 at 4:37 AM, David Landgren da...@landgren.net wrote: Yeah, this is the killer. In an ideal world, we would kill the symlinks such as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These could be recreated via shell scripts locally on mirrors for people who wish to maintain these legacies. Cutting that out would diminish the rsync burden considerably. David or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the mirroring process as a database mirror via a TBD compact database diff protocol (I have no intention of doing any of this myself; good morning) -- It is merely a matter of persistence. -- Albert Camus