Re: Trimming the CPAN - Automatic Purging

2010-04-04 Thread David Nicol
 It hasn't been done because its outside of the scope of design for rsync.
 It's meant to sync arbitrary filesets in which many, if not all, changes are
 made out of band.  It's decidely non-trivial to implement in that mode
 unless you're willing to accept a certain window in which your database may
 be out of date.

 But, in a situation like PAUSE, where the avenues in which files can be
 introduced into the file sets is controlled, it does become trivial.  It's
 the gatekeeper, it knows who's been in or out.

so the requirements for the Solution To The Problem Which Solves A
More General Problem Than The Immediate Problem And Will Therefore
Make Whoever Sets It Up A Hero include a replacement for the current
mirroring technology stack that is tailored to mirroring distributions
possibly including on-demand caching and expiration and that is
trivial to install -- something like

  perl -MCPAN -e 'install STTPWSAMGPTTIPAWTMWSIUAH::Mirrorsuite'
  nohup nice nice perl -MSTTPWSAMGPTTIPAWTMWSIUAH::Mirrorsuite -e
'mirror cpan.org .' 


Re: Trimming the CPAN - Automatic Purging

2010-04-03 Thread Ask Bjørn Hansen

On Apr 2, 2010, at 1:50, Arthur Corliss wrote:

 And my assertion has been that the excessive stats by the server are a bigger
 impediment to synchronization than the inode count.

Well, then one of us don't understand how file systems etc work.  :-)


  - ask

Re: Trimming the CPAN - Automatic Purging

2010-04-02 Thread Ask Bjørn Hansen

On Apr 1, 2010, at 19:49, Arthur Corliss wrote:

I can't believe I'm doing this, but ...

 The main point here is that we can't use 20 inodes per distribution.  It's 
 Just Nuts.   Sure, it's only something like 400k files/inodes now - but at 
 the rate it's going it'll be a lot more soon enough.
 
 Thats a problem, but not likely the biggest drag on server I/O you're
 suffering.  Might that be ahem rsync?

That reply doesn't even make sense.

 HOWEVER: Right now more of those are wasted on other things (.readme files, 
 symlinks, ...) -- some of which have solutions in progress already.
 
 I don't think anyone is arguing that we NEED to delete the old 
 distributions; only that they do indeed have a cost to keep around in the 
 main CPAN.
 
 You're right, I'm not arguing the need for the cruft.  I've only pointed out
 the obvious reality that trimming files only postpones the I/O management
 issues that at some time are likely going to have to be addressed, anyway.
 And that you'll get less bang for the buck (or man hour) by treating the
 symptoms, not the disease.
 
 For the record:  if that's what you want to do, have at it.  Let's just not
 be disingenuous about the fact that we're abrogating our responsibilities as
 technologists by refusing to address the real problems and weaknesses of the
 platform.

You are confusing we, I and you again.



Yes, I (and I'm guessing everyone else who have thought about it for more than 
say 5 seconds) agree that having rsync remember the file tree to save the disk 
IO for each sync sounds like an obvious solution.  

But reality is more complicated.  If it was such an obviously good solution 
someone would have done it by now.  (For starters play this question: What is 
the kernel cache?).

Andreas' solution is much more sensible -- and as have been pointed out before 
we DO USE THAT; but the problem here is not with clients who are interested 
enough to do something special and dedicate resources to their CPAN mirroring.


 - ask



Re: Trimming the CPAN - Automatic Purging

2010-04-02 Thread Eakin, Lee
Much of this discussion is beyond my depth but in terms of keeping it
simple, and trying to limit the stat calls on the upstream servers,
what about DNS as a replication model?  You could break up the tree at
logical divisions similar to zones and assign them serial numbers
(say a .serial file) and then still use rsync, but broken up into modules to
avoid recursion into sub-trees where the serial number is up to date?
The rsyncd.conf could be published also so replicas use the same
include/exclude logic.
-lee


Re: Trimming the CPAN - Automatic Purging

2010-04-02 Thread Arthur Corliss

On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote:



On Apr 2, 2010, at 1:50, Arthur Corliss wrote:


And my assertion has been that the excessive stats by the server are a bigger
impediment to synchronization than the inode count.


Well, then one of us don't understand how file systems etc work.  :-)


Indeed.  If you're running UFS perhaps you might have a gripe.  But with
many filesystems in use supporting dynamic allocation groups with the inode
data stored near the actually data blocks, along with b-tree indexing, this
isn't as much of an issue for many of us.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-04-01 Thread Arthur Corliss

On Wed, 31 Mar 2010, Ask Bj?rn Hansen wrote:

snip


Everyone who doesn't run mirrors says oh, who cares - it doesn't bother me.

Some of us who does run mirrors say actually, that sort of thing is important and 
an actual issue..

Others reply then you're doing it wrong.   But nobody came with something reality based 
that'd be right.


Some revisionist history here.  I run mirrors (not CPAN) and know full well
the limitations and inefficiencies of rsync.  To date, not one of you have
been able to refute that for this scale rsync is hurting you.  But most of
you have been obstinately against find a more efficient way of doing things.

I've made a viable suggestion, and offered some time to work on it.  But
you've made it abundantly clear that it's not welcome.


The main point here is that we can't use 20 inodes per distribution.  It's Just 
Nuts.   Sure, it's only something like 400k files/inodes now - but at the rate 
it's going it'll be a lot more soon enough.


Thats a problem, but not likely the biggest drag on server I/O you're
suffering.  Might that be ahem rsync?


HOWEVER: Right now more of those are wasted on other things (.readme files, 
symlinks, ...) -- some of which have solutions in progress already.

I don't think anyone is arguing that we NEED to delete the old distributions; 
only that they do indeed have a cost to keep around in the main CPAN.


You're right, I'm not arguing the need for the cruft.  I've only pointed out
the obvious reality that trimming files only postpones the I/O management
issues that at some time are likely going to have to be addressed, anyway.
And that you'll get less bang for the buck (or man hour) by treating the
symptoms, not the disease.

For the record:  if that's what you want to do, have at it.  Let's just not
be disingenuous about the fact that we're abrogating our responsibilities as
technologists by refusing to address the real problems and weaknesses of the
platform.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-04-01 Thread Arthur Corliss

On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote:


I can't believe I'm doing this, but ...


:-) All for entertainment's sake...


The main point here is that we can't use 20 inodes per distribution.  It's Just 
Nuts.   Sure, it's only something like 400k files/inodes now - but at the rate 
it's going it'll be a lot more soon enough.


Thats a problem, but not likely the biggest drag on server I/O you're
suffering.  Might that be ahem rsync?


That reply doesn't even make sense.


Then you've ignored most of this thread.  Inode counts themselves aren't
indicative of anything.  It's the I/O access patterns that are.  And my
assertion has been that the excessive stats by the server are a bigger
impediment to synchronization than the inode count.


You're right, I'm not arguing the need for the cruft.  I've only pointed out
the obvious reality that trimming files only postpones the I/O management
issues that at some time are likely going to have to be addressed, anyway.
And that you'll get less bang for the buck (or man hour) by treating the
symptoms, not the disease.

For the record:  if that's what you want to do, have at it.  Let's just not
be disingenuous about the fact that we're abrogating our responsibilities as
technologists by refusing to address the real problems and weaknesses of the
platform.


You are confusing we, I and you again.


Perhaps.




Yes, I (and I'm guessing everyone else who have thought about it for more than say 5 
seconds) agree that having rsync remember the file tree to save the disk IO for each sync 
sounds like an obvious solution.

But reality is more complicated.  If it was such an obviously good solution someone would 
have done it by now.  (For starters play this question: What is the kernel 
cache?).


It hasn't been done because its outside of the scope of design for rsync.
It's meant to sync arbitrary filesets in which many, if not all, changes are
made out of band.  It's decidely non-trivial to implement in that mode
unless you're willing to accept a certain window in which your database may
be out of date.

But, in a situation like PAUSE, where the avenues in which files can be
introduced into the file sets is controlled, it does become trivial.  It's
the gatekeeper, it knows who's been in or out.


Andreas' solution is much more sensible -- and as have been pointed out before 
we DO USE THAT; but the problem here is not with clients who are interested 
enough to do something special and dedicate resources to their CPAN mirroring.


By all means, I'm not opposed to any solution that actually addresses the
problem.  I don't agree that would be the fast time to implementation, but
no questions as to whether File::Rsync::Mirror::Recent would help things.
I'd support (and help) that goal.

My objections are more properly directed to those stuck on just deleting
files from the tree.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-04-01 Thread Arthur Corliss

On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote:


Talk = ZzZz.
Code = Interesting.
Deployment = Useful.


Please.  The talk serves to gauge interest before I waste any time
implementing a solution that's already been rejected out of hand.  As I've
mentioned repeatedly I already use rsync, albeit on much smaller filesets
which don't kill my servers.

So far I haven't seen much openness by those actually affected by the problem
in considering an alternative to rsync.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread Nicholas Clark
On Tue, Mar 30, 2010 at 10:08:57PM +0200, Rene Schickbauer wrote:

 Now, if we where to put all files into mercurial, git or the like, 
 renaming the files so they don't have version numbers in their names but 
 storing them sequentially as commits so new versions update old ones.

Sort of like Schwern already did?

http://github.com/gitpan

Nicholas Clark


Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread Rene Schickbauer

Nicholas Clark wrote:

On Tue, Mar 30, 2010 at 10:08:57PM +0200, Rene Schickbauer wrote:

Now, if we where to put all files into mercurial, git or the like, 
renaming the files so they don't have version numbers in their names but 
storing them sequentially as commits so new versions update old ones.


Sort of like Schwern already did?

http://github.com/gitpan


Yeah, looks about right at first glance. Didn't know that one, 
definitively have to look into this a bit more ;-)


LG
Rene


Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread Rene Schickbauer

David Nicol wrote:

On Sun, Mar 28, 2010 at 2:32 PM, Elaine Ashton eash...@mac.com wrote:

On Mar 28, 2010, at 12:48 PM, Randy Kobes wrote:


Has some sort of disk quota system for CPAN author accounts ever been 
considered?

Not specifically, no, at least not that I'm aware of. That would have to be 
implemented on PAUSE and quotas frequently end up not solving the real problem 
and create a headache both for the sysadmin and the users.


new proposal: Make modules pay rent in order to remain on a mirror.
Rent could be in the form of actual user interest, or good reviews.


Hmm, this can *only* work as long as that model is not applied to the 
main server: Just because a module is seldomly used doesn't 
automatically mean it is not vital to *someone*.


Modules that might fit into this category are many Acme modules. For 
example, i use Acme::Don't sometimes, cause it's better better for 
temporarly commenting out code sections than if(0)


LG
Rene


Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread Dana Hudes
Arthur your ignorance is apalling
Go look at what ORCA does 
SAR doesn't give you the info 
With ORCA i have any thing from kstat or iostat. It goes into roundrobin 
database with rrdtool. 

Procallaotr does for linux what 
orcallator does for solaris where it is the standard performance toool 
--Original Message--
From: Arthur Corliss
To: Dana Hudes
Cc: module-authors@perl.org
Sent: Mar 29, 2010 1:12 PM
Subject: Re: Trimming the CPAN - Automatic Purging

On Mon, 29 Mar 2010, Dana Hudes wrote:

 Orcallator, procallator and friends aren't shiny new toys
 Adrian Cockroft wrote initial version of orcallator in the early 90s for his 
 book Solaris Performance Tuning. The 2nd edition is I think 1998.
 The current version of ORCA (processes the collected data) is from I believe 
 2007 or so
 www.orcaware.org i think it was

I was being facetious.  Your immediate dismissal of SAR is ill-advised.  I'm
wearing my abestos-lined boxers, so I'll lob this little inflammatory gem
out there:  if you're running a server (especially in production) and you're
*not* running SAR, you're a freaking idiot.

Profiling individual programs is all well and good for occasional or
developer use, but the point of SAR is to give you a global view into the
health of your system and to identify architectural bottlenecks.  I think it
would be greatly entertaining for Elaine or any of the other mirror
operators to post their SAR reports so you guys can see the huge amount of
abuse being heaped on their servers.

SAR is debatably one of the lowest overhead methods of gaining that
macroscopic view, and it still has profiling value on development systems
when you're testing a specific workload.

To ignore SAR is to show zero competence as a sys-admin.

--Arthur Corliss
  Live Free or Die


Sent from my BlackBerry® smartphone with Nextel Direct Connect

Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread Adam Kennedy
I've said nothing till now, because I figured more noise wouldn't help much.

But I quite like the rsync daemon/proxy idea, and as it so happens I'm
attending the OzLabs Unconference in 3 weeks time to hang out with
Tridge, Rusty and the other Australia C/Kernel/Samba/RSync elites.

So I'd be happy to raise any issues or ideas in this area with them in
person over beers.

Adam K

On Sun, Mar 28, 2010 at 7:08 PM, Eric Wilhelm enoba...@gmail.com wrote:
 Or even write an rsync daemon (or proxy perhaps) in Perl.  So, when the
 client asks for a file, you can answer without checking the disk.  Can
 something like that work with an unmodified client, or does the amount
 of data needed to answer a naive client overwhelm any potential gain?

 Unfortunately the protocol is not formally documented and the perl code
 I've seen (File::RsyncP) seems to be lagging:


Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread Nicholas Clark
On Wed, Mar 31, 2010 at 01:03:51PM +1100, Adam Kennedy wrote:
 I've said nothing till now, because I figured more noise wouldn't help much.
 
 But I quite like the rsync daemon/proxy idea, and as it so happens I'm
 attending the OzLabs Unconference in 3 weeks time to hang out with
 Tridge, Rusty and the other Australia C/Kernel/Samba/RSync elites.
 
 So I'd be happy to raise any issues or ideas in this area with them in
 person over beers.

I can see two possibly useful things (and I have no idea if either is yet
possible, or a great understanding of how the protocol works)

1: stateful rsync daemon which doesn't scan all the time, either by
   a: Actually having a means to update
   b: Simply telling fibs, and pretending that the file system it scanned
  $n minutes ago is still current. (Which I think would work, at least for
  a mirror where files aren't edited (much) - if the server discovers that
  the client's view of that file *is* out of date, then scan that file for
  real, and give the up to date truth)

2: federated (or federate-able) server (or proxy) - so that you can say
   hand this subtree off to that other server
   This would allow the (fast, existing, C) rsync server to serve most of
   (say) funet.fi, handing off to a stateful server for the CPAN subtree.

Nicholas Clark


Re: Trimming the CPAN - Automatic Purging

2010-03-31 Thread David Nicol
On Wed, Mar 31, 2010 at 10:45 AM, David Landgren da...@landgren.net wrote:
 On 31/03/2010 06:52, David Nicol wrote:

 new proposal: Make modules pay rent in order to remain on a mirror.
 Rent could be in the form of actual user interest, or good reviews.

 Use as a dependency could count as rent.

 Put a value tag on things and people will game the system to ensure their
 files are up on top. Doomed to failure.

I'm not suggesting that there be any kind of who-is-on-top game, the
game is who falls out the bottom. If someone cares enough to want to
game the system to ensure their files don't fall out, those files will
surely stay.  pay rent here is intended to mean something like
tracking usage over a long period in order to authoritatively identify
old and useless based on metrics and a policy.  Especially combined
with a Dnews-like trick file server that's really a cache and only
stores things people actually ask it for, which responds to the OP's
pain as I understand it, which is a frustration that their CPAN mirror
contains a lot of cruft. Although it still isn't clear why that is a
problem.

Purpose-based partitioning could be performed like deferred sidewalks:
put the pavement where the students make the trails in the grass.


Re: Trimming the CPAN - Automatic Purging

2010-03-30 Thread David Cantrell
On Sun, Mar 28, 2010 at 07:28:48AM -0700, dhu...@hudes.org wrote:

 The danger in a CPAN::Mini and in removing old versions is that one is
 assuming that the latest and greatest is the one to use. This is false.

And this is why I run cp5.6.2an.barnyard.co.uk etc.  

It wouldn't be difficult for someone to take my code and customise it
further to, eg, also pin a few modules that rely on the particular
versions of third-party libraries that you use.

-- 
David Cantrell | Bourgeois reactionary pig

Eye have a spelling chequer / It came with my pea sea
It planely marques four my revue / Miss Steaks eye kin knot sea.
Eye strike a quay and type a word / And weight for it to say
Weather eye am wrong oar write / It shows me strait a weigh.


Re: Trimming the CPAN - Automatic Purging

2010-03-30 Thread David Cantrell
On Sun, Mar 28, 2010 at 06:04:03PM -0400, David Golden wrote:

 As always with perl, it depends.  They are laid out just as a normal
 CPAN repository, so if you have one in your urllist, something
 specified as author/distribution.tar.gz might well resolve.

Not just might well resolve.  It *will* work.  If you use one of my
cpXXXan mirrors, you're hitting a BackPAN mirror with a custom index.

  *However*,
 they don't necessarily have up-to-date index files.  Compare
 timestamps on 02packages.details.txt

Indeed.  I don't imagine that that would be hard for Andreas to keep in
sync!

-- 
David Cantrell | even more awesome than a panda-fur coat

IMO, the primary historical significance of Unix is that it marks the
time in computer history where CPUs became so cheap that it was possible
to build an operating system without adult supervision.
 -- Russ Holsclaw in a.f.c


Re: Trimming the CPAN - Automatic Purging

2010-03-30 Thread Arthur Corliss

On Tue, 30 Mar 2010, Matija Grabnar wrote:


Er, not exactly. Read
http://www.cvsup.org/howsofast.html


I had read  http://www.cvsup.org/faq.html#features  item #3.

From what I can see, cvsup uses the rsync algorithm on a file-by-file basis 
(it uses just the differential send part of the rsync algorithm). It doesn't 
rsync the whole tree, which was what I understood to be the original problem 
(wasn't the complaint about the flood of stats?).


Sounds like I may have interpreted the FAQ incorrectly, then.  Thanks for
pointing that out.  I have a few question, though: the explanation says:

   At the same time, the Tree Differ generates a list of the server's
   files.

That seems to infer that it's doing the exact same thing as rsync, so all 
the stats are still present on the server, right?


Nowhere do I see it mentioning that the daemon is maintaining state between
requests.  The primary speed-ups (beyond special file update handling) is
better use of bidirectional bandwidth.

Do you have access to a cvsup server so you can verify its behavior?

So if you want to make a tool that works fine for large mirrors, your 
priority apparently should be to reduce the lots of stats part which is 
used to determine exactly what files need to be considered for checking. 
(Rsync already makes sure all the *other* I/O operations are minimized).


Agreed.

Now the key, as I see it, is that unlike all the other use cases where rsync 
is used, large mirrors are likely to have their directories directly 
transfered from another mirror. So, the client that pulled the tree update 
down could store a list of changed files, and the server could then just use 
that list to determine which files
need to be synced to the downstream mirror. (Sure, the original site has to 
generate the list, but if they use a tool like PAUSE to upload the files, 
that shouldn't be hard to do).


Agreed, but I'm not sure we've gotten past the stat storm on the server,
though.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-30 Thread Arthur Corliss

On Tue, 30 Mar 2010, Rene Schickbauer wrote:

snip

This could work like any modern, distributed version control systems. That 
way, the user would also be able to apply local patches and/or deciding which 
changesets to pull in from the main server. Or have a complete, local mirror 
and one for the production systems where he/she pulls in changes after they 
have been reviewed.



NOW its time to kick my butt, if you want to.


:-) No one can accuse you of not being ambitious.  It's a neat idea, but
definitely an involved solution.  While it could solve a lot of problems I
think the human component is going to be your biggest obstacle.  As we've
seen from the reaction to the heretical notion of ditching rsync I have to
imagine getting everyone to ditch their favorite RCS tool would be even
worse.

Basically, we should just all get onboard with git (disclaimer:  I don't use
git myself, so my understanding may be deficient), a decentralized
distributed RCS.  And have developers periodically merge their branches.

Tough sell.  It probably would solve a bunch of issues, but you're treading
into vi versus emacs territory.  ;-)

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-30 Thread David Nicol
On Sun, Mar 28, 2010 at 2:32 PM, Elaine Ashton eash...@mac.com wrote:

 On Mar 28, 2010, at 12:48 PM, Randy Kobes wrote:


 Has some sort of disk quota system for CPAN author accounts ever been 
 considered?

 Not specifically, no, at least not that I'm aware of. That would have to be 
 implemented on PAUSE and quotas frequently end up not solving the real 
 problem and create a headache both for the sysadmin and the users.

new proposal: Make modules pay rent in order to remain on a mirror.
Rent could be in the form of actual user interest, or good reviews.

Use as a dependency could count as rent.

Or simple downloading.  A mirror server that functioned more as a
cache than a mirror would also work: only the files that are actually
requested need be stored, as long as the mirror server knows how to
get something else if requested.  If the root cause of The Pain turns
out to be full mirroring then do partial mirroring, and automate the
partition with a policy instead of trying to plan the partition.




-- 
question doubt


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, dhu...@hudes.org wrote:


The entire point of rsync is to send only changes.
Therefore once your mirror initially syncs the old versions of modules is
not the issue. Indeed, removing the old versions would present additional
burden on synchronization! The ongoing burden is the ever-growing CPAN.


That's not entirely true, particularly when you're talking about rsync.
Remember, old synced data doesn't have to be transfered, but it still needs
to be checked for potential changes, something rsync does for every request.
That generates a crap load of I/O in the form of stats on the server.


The danger in a CPAN::Mini and in removing old versions is that one is
assuming that the latest and greatest is the one to use. This is false.
Take the case of someone running old software. I personally support
systems still running Informix Dyanmic Server 7.31 as well as systems
running the latest IDS 11.5 build. We have Perl code that talks to IDS. If
DBD::Informix withdrew support for IDS 7.31 I would need both the last
version that supported it as well as the current.  I can get away with
upgrading Perl, maybe, but to upgrade the dbms is much more problematic
(license, for one thing; SQL changes another).


This is a good example of the potentials of pruning, to be certain.  Even if
all the authors dutifully documented all the necessary scenarios that would
require pinning specific versions on CPAN it's almost guaranteed that
there's still going to be collateral damage.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, Nicholas Clark wrote:


Are you running a large public mirror site, where you don't even have
knowledge of who is mirroring from you?

(Not even knowledge, let alone channels of communication with, let alone
control over)

Because (as I see it, not having done any of this) the logistics of that is
going to have as much bearing on trying to change protocols as the actual
technical merits of the protocol itself.


I do run mirrors and am mirrored from.  Not on the scale of CPAN (in terms
of file count), but having been long aware of the effect of rsync servers I
have explored the scalability aspects of it.

It should have been obvious that trying to facilitate a cut-over to a new
syncing tool can't be done on this scale in one fell swoop.  Obviously,
there'd have to be a gradual migration where protocols are supported
concurrently, much like FTP  rsync are currently both supported.  We add a
new option and encourage people to move over.  Since we already have a list
of the public mirrors we should have some idea of where to start that
conversation.


Most of the cost of rsync is an externality to the clients. If one has an
existing mirror, one is using rsync to keep it up to date, what's the
incentive to change?


Common sense and professional courtesy.  Especially because it's likely that
some clients running public mirrors may be a sync source for some private
mirrors.  They may not feel the pain of the master repositories, but they
certainly share a portion.  And it's not likely that many mirrors have a 
capital budget to support scaling a free service, so it would be best to 
make efficient use of those resources.



I'm missing something here, I suspect. How can HTTP be more efficient than
rsync? The only obvious method to me of mirroring a CPAN site by HTTP is to
instruct a client (such as wget) to get it all. In which case, in the course
of doing this the client is going to recurse over the entire directory tree
of the server, which, I thought, was functionally equivalent to the behaviour
of the rsync server.


You are missing something, but I may have not been explicit enough.  HTTP or
FTP can easily be the payload transport, once you know the precise files
that need to be transferred.  That is tremendously more efficient than what
rsync does on the server.  So, use rsync (or FTP mgets, etc.) to transfer
your transaction logs, compile a list of new files to retrieve, and use the
very common and low-overhead protocols to transfer the files...

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, Elaine Ashton wrote:


I do very much like Tim's proposal for giving old modules a push to BackPAN 
since, with proper communication of the changes to the authors along with a way 
to mark exceptions, this would rid CPAN of a lot of cruft that should be on 
BackPan anyway.


I'm not trying to be a dick (not intentionally, anyway), but isn't that
basically making your problem BackPan's problem?

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, Andreas J. Koenig wrote:


Says the author of a module named Paranoid. A lovely coincidence.


:-) As they say, just because you may be paranoid, it doesn't mean that no
one's out to get you.


If you want to study the CPAN checkpointed logs solution running on
the very CPAN for exactly one year now: File::Rsync::Mirror::Recent

What needs to be done is really extremely trivial: rewrite it in C and
convince the rsync people to incoude it in rsync code base. Just that.

So are you a taker, Arthur?


Heh, nice.  That sounds much more involved than my proposal, plus it leaves
us entirely at the mercy of an outside organization (the rsync folks) who
may or may not care about our needs.

I think it would be a worthy cause ultimately, but certainly a much longer
time to implementation, and considerably more effort.  Kind of sounds like
the normal stonewalling I've been getting these last few days by our
resident rsync fetishists.

Very ironic.  I use the hell out of rsync, just more discriminately that you
guys, and yet I'm public enemy number one.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, Dana Hudes wrote:


Use of wget and http to download an entire site means numerous TCP opens and 
HTTP GET requests. The entire point of rsync is that it knows there are 
numerous downloads. It does ONE open. This allows TCP slow start to ramp up


That wasn't exactly what I was suggesting.  And we'll ignore HTTP's
Keep-Alive support for the time being which negates your TCP open issue.  If
you're fetching transaction logs by which you can determine beforehand
precisely what files to retrieve HTTP or FTP will beat the pants off of
allowing rsync to tell you what you need to retrieve and delivering it.


A multi-download session with ftp is also efficient. Clients like ncftp have 
batch transfer built in. If setting up an initial mirror you might do better 
with ftp but maintaining it is where rsync rules.

I haven't looked closely but I have the impression from watching wget work that 
wget using HTTP::Date opens two TCP connections per file: it opens a socket and 
issues a r?quest for timestamp then closes it then opens a socket to issue an 
http GET if it wants the file. Then it closes that socket and the process 
repeats for next file. It keeps hoping for the timestanp even if the server 
doesn't support http::Date

Rsync and ftp are stateful; http is not. For absolute getting one file http is 
better since you skip the whole login thing and setting up data and control 
sockets.
So a CPAN client session will do better with an http mirror: it gets a tar.gz 
opens it up processes it and then goes back many seconds from original request 
for the first dependency. Repeat until entire dependency tree is completed


Dude, you definitely don't understand what we're discussing.  And neither
rsync, ftp, or http are stateful -- that's the problem.  Rsync has to
build a picture of the repositories state *per* request, even the old files
that haven't been touched in years.  It then uses that information to select
and deliver the new files you need.  Maintaining state means that you
maintain knowledge of state over time, across multiple requests.  And rsync
doesn't do that, it simulates that.  Quite cleverly, but in an very
expensive way which is borne by the server.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Dana Hudes
Orcallator, procallator and friends aren't shiny new toys 
Adrian Cockroft wrote initial version of orcallator in the early 90s for his 
book Solaris Performance Tuning. The 2nd edition is I think 1998. 
The current version of ORCA (processes the collected data) is from I believe 
2007 or so 
www.orcaware.org i think it was 
Sent from my BlackBerry® smartphone with Nextel Direct Connect

-Original Message-
From: Arthur Corliss acorl...@nevaeh-linux.org
Date: Mon, 29 Mar 2010 00:31:50 
To: Dana Hudesdhu...@hudes.org
Cc: module-authors@perl.org
Subject: Re: Trimming the CPAN - Automatic Purging

On Sun, 28 Mar 2010, Dana Hudes wrote:

 Why is rsync a problem? Where is the bottleneck in the protocol or the code 
 implementing it?
 Specifics!
 SAR is antiquated doesn't give the info you really need. Using a linux 
 system? Use procallator and feed resulting collected data to ORCA. Better 
 yet, use DTrace or at least truss.  Compile rsync with profiling code -- use 
 Sun Studio 12 it runs on Linux as well as Solaris and its a free download.

Wow.  You kids and your new shiny toys...  Look, here's a nice little
specific example for you.  I run an rsync server that contains 8,700+ files
and directories.  Now, say I want to sync a mere thirty-two new files.
Making that request on my server causes the rsync daemon to stat the entire
hierarchy to the tune of 18,000+ f  lstats.  Per request.  Freaking ouch.
And that's a tolerable use-case in my mind for rsync.  That's a hell of alot
I/O generated which would take but a couple of stats to retrieve via HTTP or
FTP.  Assuming you knew what you needed already.

Now, when you add in a file set of sufficient size to exhaust filesystem
caching, plus a crap load of concurrent requests, my archaic SAR reports
written on stone tables tend to say your I/O wait states starts pushing the
load levels unacceptably high, not to mention the pages being thrashed from
memory's cache pool, high interrupts and excessive seeks on the drives, and
so on and so forth.  sniff  Cavemen are people, too.

Now, look at the size of CPAN with *hundreds* of thousands of files.  Can
you imagine that amount of I/O *per* request?!

 From a network protocol perspective rsync is quite good. If your network 
 capacity is so large that it exceeds bandwidth or IOPs of your disks you 
 probably can afford better disks or a more efficient disk storage layout.
 Are mirrors like nic.funet.fi running multiple gigabit WAN connections?  If 
 so they could sure demand stream more than a bunch of SATA2 disks can provide.

 Without performance data its a waste of time to argue against rsync

And without having had examined how rsync works on both ends it should have 
been a waste of time to argue the merits of rsync.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Steffen Mueller

Hi Elaine,

Elaine Ashton wrote:

On Mar 28, 2010, at 12:48 PM, Randy Kobes wrote:
Jarkko and I were talking about it this morning - as he's not in
favour of pruning - while trying to think of a way around the size
problem and he reminded me of the idea that, if I recall correctly
was Adreas' suggestion a while back, there be an A, B and C 'PAN' of
sorts where you could pull varying degrees of content - sort of
CPAN:Mini writ large. I don't think that idea ever got any traction
because it wouldn't really solve some of the issues for the major
upstream mirrors and the mechanics of deciding where to draw the
lines between them. I still think it's a good idea though.


This sounds a bit like the CPAN - backpan scheme but with some 
additional levels?



I do very much like Tim's proposal for giving old modules a push to
BackPAN since, with proper communication of the changes to the
authors along with a way to mark exceptions, this would rid CPAN of a
lot of cruft that should be on BackPan anyway.


I'm not even going to throw in my considerable weight on this whole 
debate of pruning*. But if backpan became the official way to access 
old versions starting from yesterday's, wouldn't that mean:


a) That the toolchain would have to be adapted to a tiered 
infrastructure (think of the indexes...)

and more importantly:
b) The backpan would have to be mirrored all over the place as well, 
thus pushing the problem to the next level?


Best regards,
Steffen

* If you must know, I don't like the means but sympathize with the goals.

PS: This isn't targeted at Elaine specifically, but can everybody please 
take a step back and relax? Please be civil.


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, Dana Hudes wrote:


Why is rsync a problem? Where is the bottleneck in the protocol or the code 
implementing it?
Specifics!
SAR is antiquated doesn't give the info you really need. Using a linux system? 
Use procallator and feed resulting collected data to ORCA. Better yet, use 
DTrace or at least truss.  Compile rsync with profiling code -- use Sun Studio 
12 it runs on Linux as well as Solaris and its a free download.


Wow.  You kids and your new shiny toys...  Look, here's a nice little
specific example for you.  I run an rsync server that contains 8,700+ files
and directories.  Now, say I want to sync a mere thirty-two new files.
Making that request on my server causes the rsync daemon to stat the entire
hierarchy to the tune of 18,000+ f  lstats.  Per request.  Freaking ouch.
And that's a tolerable use-case in my mind for rsync.  That's a hell of alot
I/O generated which would take but a couple of stats to retrieve via HTTP or
FTP.  Assuming you knew what you needed already.

Now, when you add in a file set of sufficient size to exhaust filesystem
caching, plus a crap load of concurrent requests, my archaic SAR reports
written on stone tables tend to say your I/O wait states starts pushing the
load levels unacceptably high, not to mention the pages being thrashed from
memory's cache pool, high interrupts and excessive seeks on the drives, and
so on and so forth.  sniff  Cavemen are people, too.

Now, look at the size of CPAN with *hundreds* of thousands of files.  Can
you imagine that amount of I/O *per* request?!


From a network protocol perspective rsync is quite good. If your network 
capacity is so large that it exceeds bandwidth or IOPs of your disks you 
probably can afford better disks or a more efficient disk storage layout.
Are mirrors like nic.funet.fi running multiple gigabit WAN connections?  If so 
they could sure demand stream more than a bunch of SATA2 disks can provide.

Without performance data its a waste of time to argue against rsync


And without having had examined how rsync works on both ends it should have 
been a waste of time to argue the merits of rsync.


--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Dana Hudes
I think that Andreas's concept of treating these mirrors as a database is good. 
Checkpoint logical log replay is better than a simple rsync for large numbers 
of files.  

The replication problem for databases is well-understood and open-source code 
for it is available from at least Postgresql. 

Grab the current log and any logs you're missing since last update and off you 
go 
Another approach which is a non-starter practically speaking but I will mention 
anyway:
Use zfs. Make one filesystem for each mirrored project (CPAN, freshmeat, etc). 
Daily or at other regular interval make a zfs snapshot. Purge old ones after 
some reasonable time such as 2 days. Mirror sites request a zfs incremental 
stream with the name of their last rec'd snapshot and that of the current. 
While zfs is available for Solaris 10, OpenSolaris and I believe FreeBSD (the 
Mac OSX port halted IIRC) this isn't available enough for major mirrors to use 
Sent from my BlackBerry® smartphone with Nextel Direct Connect

Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Eric Wilhelm
# from Andreas J. Koenig
# on Saturday 27 March 2010 21:02:

If you want to study the CPAN checkpointed logs solution running on
the very CPAN for exactly one year now: File::Rsync::Mirror::Recent

What needs to be done is really extremely trivial: rewrite it in C and
convince the rsync people to incoude it in rsync code base. Just that.

Or even write an rsync daemon (or proxy perhaps) in Perl.  So, when the 
client asks for a file, you can answer without checking the disk.  Can 
something like that work with an unmodified client, or does the amount 
of data needed to answer a naive client overwhelm any potential gain?

Unfortunately the protocol is not formally documented and the perl code 
I've seen (File::RsyncP) seems to be lagging:

  http://lists.samba.org/archive/rsync/2008-October/021912.html

If it's possible for a mirror operator to install something that will 
immediately save them a ton of disk I/O without any changes upstream or 
downstream, then the person who makes the decision (and does the work) 
gets the benefit.  Scenarios where authors or downstream mirrors must 
do something special are a tougher sell.

--Eric
-- 
Turns out the optimal technique is to put it in reverse and gun it.
--Steven Squyres (on challenges in interplanetary robot navigation)
---
http://scratchcomputing.com
---


Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Elaine Ashton

On Mar 28, 2010, at 12:52 AM, Arthur Corliss wrote:
 
 :-) You'll have to pardon my indiscriminate epithets.  The barbs are coming
 from multiple directions.  My point still stands, however.  Your experience,
 however worthy, has zero bearing on whether or not my experience is
 just as worthy.  Even moreso when you guys have zero clue who you're talking
 to.  And you shouldn't have to know.  I would have thought simple communal 
 and professional courtesy would be extended and all points considered in 
 earnest.  Which does not appear to be the case.

I'm not sending any barbs, only my reasonable opinion borne from years on the 
reality-based operations side of this equation. As for who you are, it doesn't 
matter as I work daily with those who wrote, and continue to write, large 
chunks of operating systems, X, etc., and though their legend may precede them 
when it comes to my having to implement what works fabulously in their 
imagination, I do my best to bring them back to the grim reality that is 
operations. It's a frequent problem of engineers and those of us stuck having 
to live with and fix their grand ideas. Lofty goals usually die somewhere 
between dreams and production. 

 Ah, you're one of them.  All objects look like nails when all you have is a
 hammer, eh?  Rsync is a good tool, but like Perl, it isn't the perfect tool
 for all tasks.  You've obviously exceeded what the tool was designed for,
 it's only logical to look for (or write) another tool.  Ironically, what I'm 
 suggesting is so basic that rsync can be replaced by a script which will 
 likely run on every mirror out there with no more fuss than rsync.

Well, you'll have to forgive those who mock your näivete as if it were so basic 
and trivial to replace rsync, it would have been done several times over by now 
as it's limitations are well known to all who use it on any large scale. 
However, it is a well-known, well-used, multi-platform and time-tested tool 
that will not be unseated very easily without good reason and a reason that 
reads something along the lines of improving performance on an archive that 
should have been trimmed back a bit is not a compelling reason for adoption. 

 What you're overlooking is that CPAN has, and will, continue to grow.  Even 
 if you remove the cruft now at some point it might grow to the same size just 
 with fresh files.  When that happens, you're right back where you are now.  
 Rsync can't cut it, it wasn't designed for this.

And this is a good point to make, yes, it will continue to grow and I know that 
the current manager(s) of nic.funet.fi have commented on the burden it presents 
to the system which is also home to a number of other mirrors. You cannot 
assume that the generosity and the resources of the mirror ops are limitless 
and finding out where that limit lies will come too late to make amends. 

Pruning back the archive is a good compromise until and unless another solution 
can be done that will not bother the mirror ops terribly much in terms of real 
work.

e.

Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread dhudes
The entire point of rsync is to send only changes.
Therefore once your mirror initially syncs the old versions of modules is
not the issue. Indeed, removing the old versions would present additional
burden on synchronization! The ongoing burden is the ever-growing CPAN.

The danger in a CPAN::Mini and in removing old versions is that one is
assuming that the latest and greatest is the one to use. This is false.
Take the case of someone running old software. I personally support
systems still running Informix Dyanmic Server 7.31 as well as systems
running the latest IDS 11.5 build. We have Perl code that talks to IDS. If
DBD::Informix withdrew support for IDS 7.31 I would need both the last
version that supported it as well as the current.  I can get away with
upgrading Perl, maybe, but to upgrade the dbms is much more problematic
(license, for one thing; SQL changes another).





Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Nicholas Clark
On Sat, Mar 27, 2010 at 08:52:22PM -0800, Arthur Corliss wrote:
 On Sat, 27 Mar 2010, Elaine Ashton wrote:
 
 Actually, I thought I was merely offering my opinion both as the sysadmin 
 for the canonical CPAN mothership and as an end-user. If that makes me a 
 prick, well, I suppose I should go out and buy one :)
 
 :-) You'll have to pardon my indiscriminate epithets.  The barbs are coming
 from multiple directions.  My point still stands, however.  Your experience,
 however worthy, has zero bearing on whether or not my experience is
 just as worthy.  Even moreso when you guys have zero clue who you're talking

Are you running a large public mirror site, where you don't even have
knowledge of who is mirroring from you?

(Not even knowledge, let alone channels of communication with, let alone
control over)

Because (as I see it, not having done any of this) the logistics of that is
going to have as much bearing on trying to change protocols as the actual
technical merits of the protocol itself.

Most of the cost of rsync is an externality to the clients. If one has an
existing mirror, one is using rsync to keep it up to date, what's the
incentive to change?

 Sounds like you may be hamstrung by your own bureacracy, but that's rarely
 the case in most the places I've worked.  Not to mention that between
 passive mode FTP or even using an HTTP proxy (most of which support FTP
 requests) what I'm proposing is relatively painless, simple, and easy to
 secure.  This concern I suspect is a non-issue for most mirror operators.
 Even if it was, allow them to pull it via HTTP for all I care.  Either one
 is significantly more efficient than rsync.

I'm missing something here, I suspect. How can HTTP be more efficient than
rsync? The only obvious method to me of mirroring a CPAN site by HTTP is to
instruct a client (such as wget) to get it all. In which case, in the course
of doing this the client is going to recurse over the entire directory tree
of the server, which, I thought, was functionally equivalent to the behaviour
of the rsync server.

Nicholas Clark


Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Jonathan Yu
On Sun, Mar 28, 2010 at 12:55 PM, Dana Hudes dhu...@hudes.org wrote:
 But you can't use CPAN.pm on the Backpan.
Can't you? It's just a mirror, so if you point CPAN.pm to the backpan,
you should be able to install packages from there (though to get the
version you want you'll need to specify the author/package name
manually I think).

Of course, I've never done this myself, so I could be mistaken

 --Original Message--
 From: Shlomi Fish
 To: module-authors@perl.org
 Cc: dhu...@hudes.org
 Sent: Mar 28, 2010 11:31 AM
 Subject: Re: Trimming the CPAN - Automatic Purging

 On Sunday 28 Mar 2010 17:28:48 dhu...@hudes.org wrote:
 The entire point of rsync is to send only changes.
 Therefore once your mirror initially syncs the old versions of modules is
 not the issue. Indeed, removing the old versions would present additional
 burden on synchronization! The ongoing burden is the ever-growing CPAN.

 The danger in a CPAN::Mini and in removing old versions is that one is
 assuming that the latest and greatest is the one to use. This is false.
 Take the case of someone running old software. I personally support
 systems still running Informix Dyanmic Server 7.31 as well as systems
 running the latest IDS 11.5 build. We have Perl code that talks to IDS. If
 DBD::Informix withdrew support for IDS 7.31 I would need both the last
 version that supported it as well as the current.  I can get away with
 upgrading Perl, maybe, but to upgrade the dbms is much more problematic
 (license, for one thing; SQL changes another).

 You can always get the old versions from the Backpan, which keeps all
 historical versions - so it's a non-issue.

 Regards,

        Shlomi Fish

 --
 -
 Shlomi Fish       http://www.shlomifish.org/
 Best Introductory Programming Language - http://shlom.in/intro-lang

 Deletionists delete Wikipedia articles that they consider lame.
 Chuck Norris deletes deletionists whom he considers lame.

 Please reply to list if it's a mailing list post - http://shlom.in/reply .


 Sent from my BlackBerry® smartphone with Nextel Direct Connect


Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Dana Hudes
Why is rsync a problem? Where is the bottleneck in the protocol or the code 
implementing it?
Specifics!
SAR is antiquated doesn't give the info you really need. Using a linux system? 
Use procallator and feed resulting collected data to ORCA. Better yet, use 
DTrace or at least truss.  Compile rsync with profiling code -- use Sun Studio 
12 it runs on Linux as well as Solaris and its a free download. 

From a network protocol perspective rsync is quite good. If your network 
capacity is so large that it exceeds bandwidth or IOPs of your disks you 
probably can afford better disks or a more efficient disk storage layout. 
Are mirrors like nic.funet.fi running multiple gigabit WAN connections?  If so 
they could sure demand stream more than a bunch of SATA2 disks can provide. 

Without performance data its a waste of time to argue against rsync 

Sent from my BlackBerry® smartphone with Nextel Direct Connect

Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Aristotle Pagaltzis
* Graham Barr gb...@pobox.com [2010-03-26 10:20]:
 On Mar 25, 2010, at 8:42 AM, Barbie wrote:
 Lastly I would also personnally be annoyed if only the latest
 versions were available, as I often make great use of the diff
 tool on search.cpan.org. Having only the latest version
 renders that great tool redundant :(

 I use that too :-) and it is very annoying that some authors
 automatically delete previous releases when they upload a new
 one.

Why does that have to be constrained by the current availability
of modules? Couldn’t search.cpan.org simply not honour deletions?
Would there be any serious reason against this?

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Aristotle Pagaltzis
* Nicholas Clark n...@ccl4.org [2010-03-28 18:20]:
 I'm missing something here, I suspect.

Yes, you are.

 How can HTTP be more efficient than rsync? The only obvious
 method to me of mirroring a CPAN site by HTTP is to instruct
 a client (such as wget) to get it all.

As Arthur has repeatedly pointed this out: by first fetching
a transaction log from the remote end, then playing it forward
from the last synch point.

(This is essentially what CPAN::Mini already does.)

It’s not very efficient protocol-wise, but it sure is rather
cheap in terms of server I/O.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Aristotle Pagaltzis
* Dana Hudes dhu...@hudes.org [2010-03-29 04:30]:
 Using http for this is inefficient It makes for slower file
 transfer because you keep rerunning path mtu probes and tcp
 slow start  It makes extra socket handles opening and closing

Errm, you missed the last decade. (HTTP/1.1 has keep-alive and
pipelining and it’s 10 years old now.)

 In the case of CPAN you don't have to go the log route. If the
 mirror knows it last synch time it can use rsync to get the
 modlist et al and import to SQLITE then query by date to come
 up with the list of files to fetch -- via ftp.

Say what? Stat via rsync to feed an SQLite database that drives
an FTP transfer? Could you even possibly come up with a more
Rube-Goldbergian construction?

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Andy Armstrong
On 26 Mar 2010, at 23:32, Arthur Corliss wrote:
 But it's the weakest and simplest link to replace.


Quite a bit of the discussion here on this topic has revolved around an 
explanation of why that isn't the case. Setting up rsync is trivial for mirror 
operators. Any alternative would likely be less so.

-- 
Andy Armstrong, Hexten





Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Andy Armstrong
On 27 Mar 2010, at 00:59, Elaine Ashton wrote:
 The only snag I can forsee in trimming back on the abundance of modules is 
 the case where some modules have version requirements for other modules where 
 it will barf with a mismatch/newer version of the required module (I bumped 
 into this recently but can't remember exactly which module it was) but I 
 think it's rare and the practise should be discouraged.


Maybe that could be solved by having the clients (and maybe search.cpan.org) 
automagically fall back to a backpan mirror?

And, yes, if it's considered a good idea I /am/ prepared to do something about 
it.

-- 
Andy Armstrong, Hexten





RE: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Jan Dubois
On Fri, 26 Mar 2010, Arthur Corliss wrote:
 But what the hell do I know.  I don't run a *CPAN* mirror, so I must be
 freaking clueless...

It's not about what you know, but about what you are willing to
do yourself.

At some point you have to accept that the people who *do* the work
decide *how* they do it.

There is not much point in just talking to volunteers that they should
not be doing something but instead be doing something else if you are
not willing to take the burden of doing this other thing yourself.

Volunteers are not free labor that the talking masses can direct with
majority votes. :)

Cheers,
-Jan




Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Jarkko Hietaniemi

On Friday-201003-26 13:20, Arthur Corliss wrote:

On Fri, 26 Mar 2010, Andy Lester wrote:


Absolutely.  This factual info would ideally look like this:

Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year 
older than the most recent one.  If those distros with versions more than a year out of 
date were purged, the number of files would decrease from 200,000 to 120,000.  This would 
save 7GB out of the 12GB that a full CPAN mirror takes now.  Removing that 7GB would mean 
Benefit X to mirror owners.

Without that, how can module authors be bothered to care?


If you don't mind me interjecting, I still can't be bothered to care.  We
have basically a 12GB data set, and we're worried about that?  I see that a
small barrier to bringing on new mirrors on constrained pipes, but
ultimately that's not that big a deal.  Hell, there's single versions of
some Linux distros that are bigger than that.


The total size is not the problem.  The number of files is.  Vanilla
rsync is horribly inefficient (not the protocol, which is genius, mind)
because a client coming by and asking for updates basically ends up
requiring the moral equivalent of
find . -type f -print.  Let me repeat that: each client.  Not fun.



Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Jarkko Hietaniemi

On Friday-201003-26 19:02, Arthur Corliss wrote:

On Fri, 26 Mar 2010, Jarkko Hietaniemi wrote:


The total size is not the problem.  The number of files is.  Vanilla
rsync is horribly inefficient (not the protocol, which is genius, mind)
because a client coming by and asking for updates basically ends up
requiring the moral equivalent of
find . -type f -print.  Let me repeat that: each client.  Not fun.


Why use rsync, then?  Why not have checkpointed logs on cpan with
additions/removals logged by date so you can roll forward on the client,
processing only those files?  It would be trivial to set up and a lot more
efficient.


We wait your implementation breathlessly.  By the time all the CPAN 
mirrors have started using that, we probably will be rather blue in

the face.


--Arthur Corliss
  Live Free or Die





Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Fri, 26 Mar 2010, Elaine Ashton wrote:


Oh, don't be such a drama queen. I rebuilt and helped run nic.funet.fi for 2 
years which is the canonical mirror for a large number of mirrors and the 
perspective of having a few terabytes spinning in storage changes quite 
dramatically when you are actually serving a few terabytes to thousands of 
clients. CPAN grew to be quite a burden on the site not only because of the 
high demand, but also because of the multitude of small files and I'm sure 
other mirrors feel similarly burdened.


Don't be such an arrogant prick.  You guys made baseless assumptions about
people's experience with storage management in an attempt to diregard their
opinions.  That's being a dick by any metric.


The sort of pruning Tim brought up has long been an idea, but with the current 
and growing size of the archive, something does need to be done to alleviate 
the burden not only on the canonical mirrors, but also on the random folks who 
want to grab a local mirror for themselves. In my present work environment, 
12gb isn't a lot of disk space, but it's a lot considering I don't need to 
install perl modules daily and the vast majority of it I'll likely never use. 
It would be a kindness to both the mirror operators and to the end-users to 
trim it down to a manageable size.


I think I was quite explicit in saying that efficiencies should be pursued
in multiple areas, but the predominant bitch I took away from your thread
dealt with the burden of synchronizing mirrors.  What's the easiest way to
address that pain?  I don't believe it's your method.  I'd look into the
size issue *after* you address the incredible inefficiencies of a simple
rsync.


As for efficiency, rsync remains a good tool for the job that works on nearly 
every platform which is a rather tall order to match with any other solution. 
Relegating the cruft to BackPAN to make the current CPAN slimmer and less 
demanding on all fronts is an idea that would be welcomed by more than just 
mirror ops.


Rsync is an excellent tool for smaller file sets.  I use it to sync my own
mirrors, those mirrors are typically ~10k files.  Am I surprised that it
doesn't scale when you're stat'ing every single file?  No.  Which is why
alternatives should be considered.  A simple FTP client playing a
transaction log forward is trivial.

I maintain several mirrors, most with rsync.  But that's with a clear
understanding of the size of the file set.  Use the right tool for the job.
And it seems apparent to me that rsync isn't the right tool for ~200k files.


The only snag I can forsee in trimming back on the abundance of modules is the 
case where some modules have version requirements for other modules where it 
will barf with a mismatch/newer version of the required module (I bumped into 
this recently but can't remember exactly which module it was) but I think it's 
rare and the practise should be discouraged.


Try doing a simple cost-benefit analysis.  What you guys are proposing will
help.  But not as much as simpler alternatives.  Like replacing rsync with a
perl script and modifying PAUSE to log the transactions.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Nicholas Clark
On Sat, Mar 27, 2010 at 10:52:05AM -0800, Arthur Corliss wrote:

 I think I was quite explicit in saying that efficiencies should be pursued
 in multiple areas, but the predominant bitch I took away from your thread
 dealt with the burden of synchronizing mirrors.  What's the easiest way to
 address that pain?  I don't believe it's your method.  I'd look into the
 size issue *after* you address the incredible inefficiencies of a simple
 rsync.

I

You?

Or someone else?


I am quite happy to agree that your understanding and experience of storage
management is better than mine. But that's not the key question, in a
volunteer organisation. The questions I ask, repeating Jan's comments in
another message, are.

Nicholas Clark


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Sat, 27 Mar 2010, Nicholas Clark wrote:


I

You?

Or someone else?


I am quite happy to agree that your understanding and experience of storage
management is better than mine. But that's not the key question, in a
volunteer organisation. The questions I ask, repeating Jan's comments in
another message, are.


Oh, I understand that fully.  And I'd be happy to lend some of my time.  But
you don't make people inclined to help when people are lobbing snarky
comments like we'll wait breathlessly for you to do it.  The impression
I'm getting from most of you right now is that you're hell bent on solving
the problem your way, and no one is interested in exploring the technical
merits of other approaches.

Hell, I would even help with work towards your desired method *if* I thought
that was the consensus after a genuine exchange and consideration of ideas.
I definitely won't should it appear that we have some kind of elitist cabal
that will make their decision in isolation.  If that's going to be the case
then this should have never been raised on an open forum like the module
author's list.

Quite frankly, at times some discussions on this list fail the concept of a
technical meritocracy, and tend towards an established aristocracy.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Ask Bjørn Hansen

On Mar 26, 2010, at 16:02, Arthur Corliss wrote:

 Why use rsync, then?  Why not have checkpointed logs on cpan with
 additions/removals logged by date so you can roll forward on the client,
 processing only those files?  It would be trivial to set up and a lot more
 efficient.


I find it curious that everyone who's actually involved in syncing the files or 
running mirror servers seem to think it generally sounds like a good idea and 
everyone who doesn't say it's not worth the effort.

Anyway -- we have some other ideas for cutting down the number of files that we 
already agreed on but just needs announcement (which I promised to write up, 
oops).  No, I'm not going to make Tim's mistake and suggest it here first.

Tim: Next time just get the paint in your preferred color.  :-)


 - ask



Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Sat, 27 Mar 2010, Jarkko Hietaniemi wrote:

The time-honored tradition of many open source communities is to talk. And 
talk.  And talk.  The problem is that this solves nothing.  To do, does.


You are free to decide to take this as a personal insult.


I didn't take it as an insult, I took it as what it was -- a dodge.  You
already have your minds made up and are not willing to evaluate options
on their merits.

Let's just be honest about what's going on here.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Jarkko Hietaniemi
  Oh, I understand that fully.  And I'd be happy to lend some of my 
time.  But

you don't make people inclined to help when people are lobbing snarky
comments like we'll wait breathlessly for you to do it.


The time-honored tradition of many open source communities is to talk. 
And talk.  And talk.  The problem is that this solves nothing.  To do, does.


You are free to decide to take this as a personal insult.



Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Elaine Ashton

On Mar 27, 2010, at 2:52 PM, Arthur Corliss wrote:
 
 Don't be such an arrogant prick.  You guys made baseless assumptions about
 people's experience with storage management in an attempt to diregard their
 opinions.  That's being a dick by any metric.

Actually, I thought I was merely offering my opinion both as the sysadmin for 
the canonical CPAN mothership and as an end-user. If that makes me a prick, 
well, I suppose I should go out and buy one :) 

 I think I was quite explicit in saying that efficiencies should be pursued
 in multiple areas, but the predominant bitch I took away from your thread
 dealt with the burden of synchronizing mirrors.  What's the easiest way to
 address that pain?  I don't believe it's your method.  I'd look into the
 size issue *after* you address the incredible inefficiencies of a simple
 rsync.

And you're disregarding a considerable problem that rsync is a well-established 
tool for mirroring that is easy to use and works on a very wide range of 
platforms. Asking mirror ops to adopt a new tool for mirroring one mirror, when 
they often have several or more, likely won't be met with much enthusiasm and 
would create two tiers of CPAN mirrors, those using rsync and those not, which 
would not only complicate something which should remain simple but, again, 
doesn't address the size of the archive and the multitude of small files that 
are always a consideration no matter what you're serving them up with.

 Rsync is an excellent tool for smaller file sets.  I use it to sync my own
 mirrors, those mirrors are typically ~10k files.  Am I surprised that it
 doesn't scale when you're stat'ing every single file?  No.  Which is why
 alternatives should be considered.  A simple FTP client playing a
 transaction log forward is trivial.

FTP? It's 2010 and very few corp firewalls allow ftp in or out. I can't 
remember the last time I even used ftp come to think of it. I had to go through 
2 layers of network red tape just to get rsync for a particular system I wanted 
to mirror CPAN to at work. Asking for FTP would have been met with a big no or 
a cackle, depending on which of the nyetwork masters got the request first.

 Try doing a simple cost-benefit analysis.  What you guys are proposing will
 help.  But not as much as simpler alternatives.  Like replacing rsync with a
 perl script and modifying PAUSE to log the transactions.

How is replacing rsync, a standard and widely used tool, simpler for mirror 
ops? I suppose I don't understand the opposition to trimming off the obvious 
cruft on CPAN to lighten the load when BackPAN exists to archive them. There is 
already CPAN::Mini (which was created back when CPAN was an ever-so-tiny 1.2GB) 
so it's not as though lightening the load is a new idea or an unwelcome one.

e.


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Sat, 27 Mar 2010, Elaine Ashton wrote:


Actually, I thought I was merely offering my opinion both as the sysadmin for 
the canonical CPAN mothership and as an end-user. If that makes me a prick, 
well, I suppose I should go out and buy one :)


:-) You'll have to pardon my indiscriminate epithets.  The barbs are coming
from multiple directions.  My point still stands, however.  Your experience,
however worthy, has zero bearing on whether or not my experience is
just as worthy.  Even moreso when you guys have zero clue who you're talking
to.  And you shouldn't have to know.  I would have thought simple communal 
and professional courtesy would be extended and all points considered in 
earnest.  Which does not appear to be the case.



And you're disregarding a considerable problem that rsync is a well-established 
tool for mirroring that is easy to use and works on a very wide range of 
platforms. Asking mirror ops to adopt a new tool for mirroring one mirror, when 
they often have several or more, likely won't be met with much enthusiasm and 
would create two tiers of CPAN mirrors, those using rsync and those not, which 
would not only complicate something which should remain simple but, again, 
doesn't address the size of the archive and the multitude of small files that 
are always a consideration no matter what you're serving them up with.


Ah, you're one of them.  All objects look like nails when all you have is a
hammer, eh?  Rsync is a good tool, but like Perl, it isn't the perfect tool
for all tasks.  You've obviously exceeded what the tool was designed for,
it's only logical to look for (or write) another tool.  Ironically, what I'm 
suggesting is so basic that rsync can be replaced by a script which will 
likely run on every mirror out there with no more fuss than rsync.



FTP? It's 2010 and very few corp firewalls allow ftp in or out. I can't 
remember the last time I even used ftp come to think of it. I had to go through 
2 layers of network red tape just to get rsync for a particular system I wanted 
to mirror CPAN to at work. Asking for FTP would have been met with a big no or 
a cackle, depending on which of the nyetwork masters got the request first.


Sounds like you may be hamstrung by your own bureacracy, but that's rarely
the case in most the places I've worked.  Not to mention that between
passive mode FTP or even using an HTTP proxy (most of which support FTP
requests) what I'm proposing is relatively painless, simple, and easy to
secure.  This concern I suspect is a non-issue for most mirror operators.
Even if it was, allow them to pull it via HTTP for all I care.  Either one
is significantly more efficient than rsync.


How is replacing rsync, a standard and widely used tool, simpler for mirror 
ops? I suppose I don't understand the opposition to trimming off the obvious 
cruft on CPAN to lighten the load when BackPAN exists to archive them. There is 
already CPAN::Mini (which was created back when CPAN was an ever-so-tiny 1.2GB) 
so it's not as though lightening the load is a new idea or an unwelcome one.


I'm not opposed to trimming the cruft, but I am opposed to ignorant
knee-jerk reactions bereft of any empirical data (or at least you haven't
shared).  The cruft, while being cruft, isn't inherently evil.  You have a
basic I/O and state problem.  And the I/O generated is predominantly caused 
by rsync trying to (re)assemble state on the file set, *per* request.  More

appallingly, most of that state image being generated is state that hasn't
changed in quite awhile.  Literally years in many cases.  So why are we
wasting cycles  I/O performing massively redundant work?

That's why having PAUSE implement a transaction log, and perhaps a cron job
on the master server doing daily checkpointed file manifests is so much more
efficient.  An in-sync mirror only needs to download the lastest transaction
logs and play them forward (delete certain files, download others, etc).
And, gee, just about every author on the list could write *that* sync agent
in an evening.  Out-of-sync mirrors can start by working off the checkpoint
manifest, get what's missing, and rolling forward.

What you're overlooking is that CPAN has, and will, continue to grow.  Even 
if you remove the cruft now at some point it might grow to the same size 
just with fresh files.  When that happens, you're right back where you are 
now.  Rsync can't cut it, it wasn't designed for this.


Whether you like it or not, even on a pared down CPAN rsync is easily your
most inefficient process on the server.  If you're not willing to optimize
that, then you really don't care about optimization at all.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Graham Barr

On Mar 25, 2010, at 8:42 AM, Barbie wrote:
 
 Lastly I would also personnally be annoyed if only the latest versions
 were available, as I often make great use of the diff tool on
 search.cpan.org. Having only the latest version renders that great tool
 redundant :(

I use that too :-) and it is very annoying that some authors automatically 
delete
previous releases when they upload a new one.

Graham.



Trimming the CPAN - Automatic Purging

2010-03-26 Thread Tim Bunce
Currently on PAUSE you have to explicitly delete old uploads.

How about changing it so you have to explicitly KEEP old uploads
that appear to have been superseded?

PAUSE already has a mechanism to delete files at some future point in
time. That's currently only used as part of a safety/sanity check to
delay deletions that were manually invoked.

I envisage PAUSE having a set of rules it would apply monthly, say,
to automatically select files for purging.

The rules might look something like this:

File does not have deletion date set, and
File is older than 3 months, and
File has a later upload
- in the same directory
- with the same major version
- with a higher minor version
- which is also more than 3 months old

(Naturally these are just suggestions. Let's not bikeshed the fine
details yet. It's the approach we need to discuss first.)

Files selected in this way would be scheduled to be deleted in a month
and an email would be sent to the authors, just as if they'd selected
the files for deletion via PAUSE.

All that's needed, in addition to the above script, is a way for authors
to indicate that a particular file shouldn't be purged. The database
could use a far-future date for that which the UI could present as
do not purge checkbox against the file.

Tim.


Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Ask Bjørn Hansen

On Mar 25, 2010, at 4:12, Tim Bunce wrote:

 Currently on PAUSE you have to explicitly delete old uploads.
 
 How about changing it so you have to explicitly KEEP old uploads
 that appear to have been superseded?

I like it.

I agree with Jarkko that there should be a way to pin some versions and the 
configuration should be more than N newer releases or some such.

I think it should be on by default though.  Older than 3 (or 6?) months and at 
least 2 or 3 (or more?) newer releases or some such.

For most authors this won't change anything -- but it'll help those who 
unhelpfully _never_ delete anything.

On Search CPAN maybe BackPAN could be used to pull in older versions for diffs 
etc...


  - ask

Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Ask Bjørn Hansen

On Mar 25, 2010, at 8:38, Andy Armstrong wrote:

 I like that solution better
 
 
 [snip]
 
 But solution to what? Are we convinced there's actually a problem here?

CPAN has almost 200k files.  www.cpan.org says there are 17627 modules.  
rsyncing a gazillion files doesn't work that well (on the server).  Helping 
authors remember to delete things that are now irrelevant from the main CPAN 
system will make it easier to run mirrors and keep them fresh.


 - ask

Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Chris Nandor
What Jarkko said.

On Mar 25, 2010, at 08:00, Jarkko Hietaniemi wrote:

 I have one case where the v1 and v2 of a module are simply
 incompatible, but v1 still works, and unless the users have a
 compelling reason, they won't migrate.  Pulling the rug from under
 them would be quite unsportsmanlike.
 
 Deletion should be opt-in, and there should be a way to pin some
 releases as unreapable.  And warning emails (yes, some email addresses
 are blackholes) to the author well in advance: your module X version
 Y will be deleted as you requested in Z weeks because there are P
 newer releases ...
 
 -- 
 There is this special biologist word we use for 'stable'. It is
 'dead'. -- Jack Cohen


-- 
Chris Nandor pu...@pobox.com http://pudge.net/
Slashdot / Geeknet   pu...@slashdot.org   http://slashdot.org/



Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Arthur Corliss

On Fri, 26 Mar 2010, Ask Bj?rn Hansen wrote:


I find it curious that everyone who's actually involved in syncing the files or running 
mirror servers seem to think it generally sounds like a good idea and everyone who 
doesn't say it's not worth the effort.


Sure, I don't run a CPAN mirror, but I do manage many, many terrabytes of
storage as part of my day job.  I think it's a tad presumptuous to disregard
input just because we're not in your inner sanctum.  As I mentioned in a
follow up e-mail:  this is simply a matter of selecting the correct problem
domain.  I believe that streamlining the mirroring process will provide
greater gains for less effort.

That's not to say that pursuing other efficiencies isn't worthwhile, just
that you need to prioritize.

But what the hell do I know.  I don't run a *CPAN* mirror, so I must be
freaking clueless...

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Elaine Ashton

On Mar 26, 2010, at 8:23 PM, Arthur Corliss wrote:
 
 Sure, I don't run a CPAN mirror, but I do manage many, many terrabytes of
 storage as part of my day job.  I think it's a tad presumptuous to disregard
 input just because we're not in your inner sanctum.  As I mentioned in a
 follow up e-mail:  this is simply a matter of selecting the correct problem
 domain.  I believe that streamlining the mirroring process will provide
 greater gains for less effort.
 
 That's not to say that pursuing other efficiencies isn't worthwhile, just
 that you need to prioritize.
 
 But what the hell do I know.  I don't run a *CPAN* mirror, so I must be
 freaking clueless...

Oh, don't be such a drama queen. I rebuilt and helped run nic.funet.fi for 2 
years which is the canonical mirror for a large number of mirrors and the 
perspective of having a few terabytes spinning in storage changes quite 
dramatically when you are actually serving a few terabytes to thousands of 
clients. CPAN grew to be quite a burden on the site not only because of the 
high demand, but also because of the multitude of small files and I'm sure 
other mirrors feel similarly burdened. 

The sort of pruning Tim brought up has long been an idea, but with the current 
and growing size of the archive, something does need to be done to alleviate 
the burden not only on the canonical mirrors, but also on the random folks who 
want to grab a local mirror for themselves. In my present work environment, 
12gb isn't a lot of disk space, but it's a lot considering I don't need to 
install perl modules daily and the vast majority of it I'll likely never use. 
It would be a kindness to both the mirror operators and to the end-users to 
trim it down to a manageable size. 

As for efficiency, rsync remains a good tool for the job that works on nearly 
every platform which is a rather tall order to match with any other solution. 
Relegating the cruft to BackPAN to make the current CPAN slimmer and less 
demanding on all fronts is an idea that would be welcomed by more than just 
mirror ops.

The only snag I can forsee in trimming back on the abundance of modules is the 
case where some modules have version requirements for other modules where it 
will barf with a mismatch/newer version of the required module (I bumped into 
this recently but can't remember exactly which module it was) but I think it's 
rare and the practise should be discouraged.

e.


Re: Trimming the CPAN - Automatic Purging

2010-03-25 Thread Barbie
On Thu, Mar 25, 2010 at 11:12:32AM +, Tim Bunce wrote:
 Currently on PAUSE you have to explicitly delete old uploads.

Which often is a good thing. While BACKPAN exists, it isn't somewhere
that many go to look for old distributions. For me and probably others,
BACKPAN only distributions are ones that have been specifically marked
by the maintainers as obsolete, badly broken or similar.

Automatic deletes from CPAN would change that.

There are many distributions on CPAN that older versions work on a
particular perl/os, but more recent ones don't. Latest isn't necessarily
the greatest. 

If you are going to perform this then it should really feed off the CPAN
Testers to know if a specific release has been marked as being the
latest working release for a particular perl/os.

I would also suggest extending the timeframe considerably to perhaps 3
or maybe 5 years.

Lastly I would also personnally be annoyed if only the latest versions
were available, as I often make great use of the diff tool on
search.cpan.org. Having only the latest version renders that great tool
redundant :(

 Files selected in this way would be scheduled to be deleted in a month
 and an email would be sent to the authors, just as if they'd selected
 the files for deletion via PAUSE.

There are already many authors who have non-responding email addresses
(I will get around to publicising that list at some point), so some
will likely disappear down a blackhole. What if you're about to delete a
set of distributions that should really be kept available? No one would
be listening to know that it should still be kept.

I would prefer a suggestion email to authors to delete, rather than an
email telling them that their distributions will be deleted unless they
do something.

Cheers,
Barbie.
-- 
Birmingham Perl Mongers http://birmingham.pm.org
Memoirs Of A Roadie http://barbie.missbarbell.co.uk
CPAN Testers Blog http://blog.cpantesters.org
YAPC Conference Surveys http://yapc-surveys.org




Re: Trimming the CPAN - Automatic Purging

2010-03-25 Thread Andy Armstrong
On 25 Mar 2010, at 15:36, Chris Nandor wrote:
 I like that solution better


[snip]

But solution to what? Are we convinced there's actually a problem here?

-- 
Andy Armstrong, Hexten





Re: Trimming the CPAN - Automatic Purging

2010-03-25 Thread Andy Lester

On Mar 25, 2010, at 10:38 AM, Andy Armstrong wrote:

 But solution to what? Are we convinced there's actually a problem here?

The first two rules of optimization club:

1) You do not optimize.
2) You do not optimize without measuring.

As soon as someone can explain specifics of the problem, including magnitude, I 
can begin to be concerned.

xoxo,
Andy

--
Andy Lester = a...@petdance.com = www.theworkinggeek.com = AIM:petdance