Re: Reducing rsync cost

2010-11-27 Thread Ask Bjørn Hansen

On 11/24/10 1:23, Nicholas Clark wrote:

The symlink tree is built by scripts, isn't it? Are they available?


Actually; since earlier this year it's not being built anymore.  I 
promised to write a note announcing it but haven't gotten around to it.


So inadvertently we got to test if anyone cared.  The answer is no.  :-)

Nobody is navigating the CPAN via FTP anymore, so they serve ~no useful 
purpose.


The longer term plan is to get rid of first all the symlink files 
(although they should get pruned as the files the links point to are 
removed) and eventually the directories (in some number of years).


I imagine it might be useful to put in index.html and README files in 
each directory with a relevant pointer to 
http://search.cpan.org/search?q=$foo - but I don't have time to rig that up.




 - ask


Re: Reducing rsync cost

2010-11-27 Thread Johan Vromans
Ask Bjørn Hansen a...@develooper.com writes:

 On 11/24/10 1:23, Nicholas Clark wrote:
 The symlink tree is built by scripts, isn't it? Are they available?
 So inadvertently we got to test if anyone cared.  The answer is no.  :-)

I can't say I'm not anyone, but apparently one of very few :( .

-- Johan


Re: Reducing rsync cost

2010-11-25 Thread David Precious
On Wednesday 24 November 2010 10:51:25 David Golden wrote:
 The new fast CPAN mirrors use File::Rsync::Mirror::Recent, which
 uses the new RECENT.* files to manage the synchronization process.
 Those files record recent changes (adds/deletes) to the frequently
 changing authors/ and modules/ directories.  The fast mirrors use
 those files to sync with PAUSE every minute or so with very low
 overhead.
[...]
 See http://tinyurl.com/35t9u3k for instructions on using F::R::M::Recent.

Thanks for the heads-up; I'd not seen that approach.  That certainly makes a 
lot of sense!

Since at a cursory glance it seems none of the current fast mirrors are in 
the UK, I'll drop a mail to c...@cpan.org offering a UK mirror to take part :)

Cheers

Dave P



Re: Reducing rsync cost

2010-11-24 Thread Nicholas Clark
On Tue, Nov 23, 2010 at 10:24:18PM +0100, David Landgren wrote:
 On 22/11/2010 15:18, David Nicol wrote:
 On Mon, Nov 22, 2010 at 4:37 AM, David Landgrenda...@landgren.net  wrote:
 Yeah, this is the killer. In an ideal world, we would kill the symlinks 
 such
 as authors/id/*, modules/by-category/*, modules/by-module/* and so on. 
 These
 could be recreated via shell scripts locally on mirrors for people who 
 wish
 to maintain these legacies. Cutting that out would diminish the rsync 
 burden
 considerably.
 
 David
 
 or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the
 mirroring process as a database mirror via a TBD compact database diff
 protocol (I have no intention of doing any of this myself; good
 morning)
 
 Well... I guess that's not going to happen then, is it?
 
 I shouldn't even bother replying, but I wouldn't want the archives to 
 think that silence indicates tacit agreement.

The symlink tree is built by scripts, isn't it? Are they available?

Because the nice thing about your suggestion is that it doesn't involve
changing any of the server infrastructure, and it's an incremental change
which can be done by each mirror in turn.

Instead of running rsync over the whole tree, it can change to run a top
level script that runs rsync over the parts that have to be copied, and then
run the symlink generation on the parts that can be recreated locally.

Nicholas Clark


Re: Reducing rsync cost

2010-11-24 Thread David Golden
On Wed, Nov 24, 2010 at 4:23 AM, Nicholas Clark n...@ccl4.org wrote:
 Instead of running rsync over the whole tree, it can change to run a top
 level script that runs rsync over the parts that have to be copied, and then
 run the symlink generation on the parts that can be recreated locally.

The new fast CPAN mirrors use File::Rsync::Mirror::Recent, which
uses the new RECENT.* files to manage the synchronization process.
Those files record recent changes (adds/deletes) to the frequently
changing authors/ and modules/ directories.  The fast mirrors use
those files to sync with PAUSE every minute or so with very low
overhead.

The fast mirrors admins are authorized by Andreas to hit PAUSE
directly, but I believe that anyone can use it with open CPAN mirrors
offering rsync service.  While it needs to run as a daemon, as of
version 0.0.8-TRIAL, all the memory intensive work happens in child
processes and the main daemon is pretty lightweight. (Mine is holding
at about 8.8 MB of memory).

The current list of fast mirrors are cpan.shadowcatprojects.net,
cpan.dagolden.com, cpan.hexten.net and cpan.cpantesters.org.  While
you probably shouldn't hit those every minute without checking with
the admins (e.g. me for cpan.dagolden.com) you can probably use
F::R::M::Recent to hit them several times an hour with no problem.

See http://tinyurl.com/35t9u3k for instructions on using F::R::M::Recent.

Regards,
David


Re: Reducing rsync cost

2010-11-24 Thread Andreas J. Koenig
 On Tue, 23 Nov 2010 22:24:18 +0100, David Landgren da...@landgren.net 
 said:

   I shouldn't even bother replying, but I wouldn't want the archives to
   think that silence indicates tacit agreement.

To give you an update: five tier-1 CPAN sites are pulling a sync every
20 seconds and every sync takes virtually no time (we could sync every
second if we had to). So this part of the problem is solved and it works
since 20 months now. We still have to go quite a long way until the
whole CPAN infrastructure works as well but the priorities have shifted
since the tier-1 problem was the most pressing one.

-- 
andreas


Re: Reducing rsync cost

2010-11-23 Thread David Landgren

On 22/11/2010 15:18, David Nicol wrote:

On Mon, Nov 22, 2010 at 4:37 AM, David Landgrenda...@landgren.net  wrote:

Yeah, this is the killer. In an ideal world, we would kill the symlinks such
as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These
could be recreated via shell scripts locally on mirrors for people who wish
to maintain these legacies. Cutting that out would diminish the rsync burden
considerably.

David


or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the
mirroring process as a database mirror via a TBD compact database diff
protocol (I have no intention of doing any of this myself; good
morning)


Well... I guess that's not going to happen then, is it?

I shouldn't even bother replying, but I wouldn't want the archives to 
think that silence indicates tacit agreement.


David



Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)

2010-11-22 Thread David Landgren

On 19/11/2010 20:57, dhu...@hudes.org wrote:

source code, even 100KLOC? Once you go to .gz you're already at better
than 2:1. What are you going to save by going to even 3:1, 10Kbytes?
compared to the nuisance inflicted, it's nothing.


Over the entire CPAN archive, it'd be significant...

I agree on the individual case it's probably not worth worrying about too
much.  But if it's easy to use .bz2 or something better it wouldn't hurt
to get that word out.  (And it may be worth making it easy, though I'm not
sure about that.)

Daniel T. Staal


Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between
mirrors. Compressing to .bz2 won't help that: the stress is doing a stat
on every single file in CPAN not the transfer. Work toward optimizing the
mirror distribution instead of worrying about bz2 vs gz.  Remember not


Yeah, this is the killer. In an ideal world, we would kill the symlinks 
such as authors/id/*, modules/by-category/*, modules/by-module/* and so 
on. These could be recreated via shell scripts locally on mirrors for 
people who wish to maintain these legacies. Cutting that out would 
diminish the rsync burden considerably.


David

--
There's bum trash in my hall and my place is ripped
I've totaled another amp, I'm calling in sick


Re: Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)

2010-11-22 Thread David Nicol
On Mon, Nov 22, 2010 at 4:37 AM, David Landgren da...@landgren.net wrote:
 Yeah, this is the killer. In an ideal world, we would kill the symlinks such
 as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These
 could be recreated via shell scripts locally on mirrors for people who wish
 to maintain these legacies. Cutting that out would diminish the rsync burden
 considerably.

 David

or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the
mirroring process as a database mirror via a TBD compact database diff
protocol (I have no intention of doing any of this myself; good
morning)

-- 
It is merely a matter of persistence. -- Albert Camus