Re: Trimming the CPAN - Automatic Purging

2010-04-02 Thread Arthur Corliss

On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote:



On Apr 2, 2010, at 1:50, Arthur Corliss wrote:


And my assertion has been that the excessive stats by the server are a bigger
impediment to synchronization than the inode count.


Well, then one of us don't understand how file systems etc work.  :-)


Indeed.  If you're running UFS perhaps you might have a gripe.  But with
many filesystems in use supporting dynamic allocation groups with the inode
data stored near the actually data blocks, along with b-tree indexing, this
isn't as much of an issue for many of us.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-04-01 Thread Arthur Corliss

On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote:


I can't believe I'm doing this, but ...


:-) All for entertainment's sake...


The main point here is that we can't use 20 inodes per distribution.  It's Just 
Nuts.   Sure, it's only something like 400k files/inodes now - but at the rate 
it's going it'll be a lot more soon enough.


Thats a problem, but not likely the biggest drag on server I/O you're
suffering.  Might that be ahem rsync?


That reply doesn't even make sense.


Then you've ignored most of this thread.  Inode counts themselves aren't
indicative of anything.  It's the I/O access patterns that are.  And my
assertion has been that the excessive stats by the server are a bigger
impediment to synchronization than the inode count.


You're right, I'm not arguing the need for the cruft.  I've only pointed out
the obvious reality that trimming files only postpones the I/O management
issues that at some time are likely going to have to be addressed, anyway.
And that you'll get less bang for the buck (or man hour) by treating the
symptoms, not the disease.

For the record:  if that's what you want to do, have at it.  Let's just not
be disingenuous about the fact that we're abrogating our responsibilities as
technologists by refusing to address the real problems and weaknesses of the
platform.


You are confusing we, I and you again.


Perhaps.




Yes, I (and I'm guessing everyone else who have thought about it for more than say 5 
seconds) agree that having rsync remember the file tree to save the disk IO for each sync 
sounds like an obvious solution.

But reality is more complicated.  If it was such an obviously good solution someone would 
have done it by now.  (For starters play this question: What is the kernel 
cache?).


It hasn't been done because its outside of the scope of design for rsync.
It's meant to sync arbitrary filesets in which many, if not all, changes are
made out of band.  It's decidely non-trivial to implement in that mode
unless you're willing to accept a certain window in which your database may
be out of date.

But, in a situation like PAUSE, where the avenues in which files can be
introduced into the file sets is controlled, it does become trivial.  It's
the gatekeeper, it knows who's been in or out.


Andreas' solution is much more sensible -- and as have been pointed out before 
we DO USE THAT; but the problem here is not with clients who are interested 
enough to do something special and dedicate resources to their CPAN mirroring.


By all means, I'm not opposed to any solution that actually addresses the
problem.  I don't agree that would be the fast time to implementation, but
no questions as to whether File::Rsync::Mirror::Recent would help things.
I'd support (and help) that goal.

My objections are more properly directed to those stuck on just deleting
files from the tree.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-03-29 Thread Arthur Corliss

On Sun, 28 Mar 2010, Andy Armstrong wrote:


We're nearly there if A == a CPAN::Mini style mirror, B == the current mirror 
pruned and C == backpan.

So the actions to make that happen are:

* give the current clients specific support for this
* generate a master mini mirror that other mini mirrors can pull from.
* prune

If we agree that this is a good solution I'm happy to do some work on it - I 
could host the mini master and I'd be happy to send Andreas a patch for CPAN.pm 
to support this scheme.


It should be pointed out that this is only viable under the assumption that
you have a separate pool of servers for each tier.  Again, this is just
load balancing, not load optimization.

That said, if you have the volunteers, then why not.  Perhaps I can offer a
system to support mirroring up here in Alaska.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Arthur Corliss

On Sun, 28 Mar 2010, Ask Bj?rn Hansen wrote:


You are misunderstanding the problem of changing the mirroring mechanism.


I am not misunderstanding, I'm just willing to accept the reality for what
it is.  Rsync does not scale.  Period.


Making new software is nice and good -- Andreas already has something that's 
better for the PAUSE data.


G  That makes my point all the more compelling, then.  Some of the work
has already been done.


Getting 1000s of mirrors to use your software (rather than rsync which they use 
for ALL OTHER mirrors -- not so easy.


Perhaps, but it's also possible that it might not be as bad as you think,
either.  You have a strong case to be made that the entire ecosystem
benefits from making this change (particularly in a tiered mirroring
environment), and I'd be surprised if the majority of the mirror operators 
aren't sympathetic and cooperative.  As a sys-admin I watch my SAR reports

like a hawk, I'm sure they're no different.

And that's not to say you have to eliminate rsync.  If you can get half of
them to stop, you'll still have some significant long term gains.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-03-28 Thread Arthur Corliss

On Sun, 28 Mar 2010, Elaine Ashton wrote:


I'm not sending any barbs, only my reasonable opinion borne from years on the 
reality-based operations side of this equation. As for who you are, it doesn't 
matter as I work daily with those who wrote, and continue to write, large 
chunks of operating systems, X, etc., and though their legend may precede them 
when it comes to my having to implement what works fabulously in their 
imagination, I do my best to bring them back to the grim reality that is 
operations. It's a frequent problem of engineers and those of us stuck having 
to live with and fix their grand ideas. Lofty goals usually die somewhere 
between dreams and production.


Ah, let the chest thumping begin.  My point is that regardless of where the 
idea comes from if it comes from a solid rationale it should be given 
consideration.  And to date I have yet to see any one of you refute my 
technical understanding of the problem, only my political understanding of 
the problem.  I/O is the issue, and it is driven predominantly by rsync.



Well, you'll have to forgive those who mock your n?ivete as if it were so basic 
and trivial to replace rsync, it would have been done several times over by now 
as it's limitations are well known to all who use it on any large scale. 
However, it is a well-known, well-used, multi-platform and time-tested tool 
that will not be unseated very easily without good reason and a reason that 
reads something along the lines of improving performance on an archive that 
should have been trimmed back a bit is not a compelling reason for adoption.


Naivete?  Again:  show me where my assertions about the primary root of your
problem is incorrect?  Show me how pruning CPAN isn't a temporary band-aid
that fails to address a fundamental weakness in the syncing process?  you
haven't.  You can try to dress it up any way you like in effort to discredit
me, but until you do based on the facts, you have nothing.

Rsync is a good tool, but for different use case scenarios.


And this is a good point to make, yes, it will continue to grow and I know that 
the current manager(s) of nic.funet.fi have commented on the burden it presents 
to the system which is also home to a number of other mirrors. You cannot 
assume that the generosity and the resources of the mirror ops are limitless 
and finding out where that limit lies will come too late to make amends.


G And you make my point for me.  I'm sure he would love to find a more
efficient use of his I/O.  I assume nothing, I only allow that you'll find
more interest than you assume in managing I/O.  Nor does what I'm proposing
preclude the intractable from continuing to use rsync.  Given that rsync is
your driver of the I/O problem taking away any significant percentage of the
problem with have the largest dividends.


Pruning back the archive is a good compromise until and unless another solution 
can be done that will not bother the mirror ops terribly much in terms of real 
work.


At least you admit you're only treating the symptoms now, not the disease
itself.  Sure, it will buy you some time, but there'll also be some
political problems to work through which will likely burn as much if not
more manhours than just treating the disease.  And in the end time runs
out and the problem remains.

Look, I don't care if you guys decide against it, but let's be honest about
the compromises you're making.  Hell, pruning isn't even a compromise, it's
not a solution, it's only a delaying tactic.

--Arthur Corliss
  Live Free or Die

Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Sat, 27 Mar 2010, Nicholas Clark wrote:


I

You?

Or someone else?


I am quite happy to agree that your understanding and experience of storage
management is better than mine. But that's not the key question, in a
volunteer organisation. The questions I ask, repeating Jan's comments in
another message, are.


Oh, I understand that fully.  And I'd be happy to lend some of my time.  But
you don't make people inclined to help when people are lobbing snarky
comments like we'll wait breathlessly for you to do it.  The impression
I'm getting from most of you right now is that you're hell bent on solving
the problem your way, and no one is interested in exploring the technical
merits of other approaches.

Hell, I would even help with work towards your desired method *if* I thought
that was the consensus after a genuine exchange and consideration of ideas.
I definitely won't should it appear that we have some kind of elitist cabal
that will make their decision in isolation.  If that's going to be the case
then this should have never been raised on an open forum like the module
author's list.

Quite frankly, at times some discussions on this list fail the concept of a
technical meritocracy, and tend towards an established aristocracy.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Sat, 27 Mar 2010, Jarkko Hietaniemi wrote:

The time-honored tradition of many open source communities is to talk. And 
talk.  And talk.  The problem is that this solves nothing.  To do, does.


You are free to decide to take this as a personal insult.


I didn't take it as an insult, I took it as what it was -- a dodge.  You
already have your minds made up and are not willing to evaluate options
on their merits.

Let's just be honest about what's going on here.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-27 Thread Arthur Corliss

On Sat, 27 Mar 2010, Elaine Ashton wrote:


Actually, I thought I was merely offering my opinion both as the sysadmin for 
the canonical CPAN mothership and as an end-user. If that makes me a prick, 
well, I suppose I should go out and buy one :)


:-) You'll have to pardon my indiscriminate epithets.  The barbs are coming
from multiple directions.  My point still stands, however.  Your experience,
however worthy, has zero bearing on whether or not my experience is
just as worthy.  Even moreso when you guys have zero clue who you're talking
to.  And you shouldn't have to know.  I would have thought simple communal 
and professional courtesy would be extended and all points considered in 
earnest.  Which does not appear to be the case.



And you're disregarding a considerable problem that rsync is a well-established 
tool for mirroring that is easy to use and works on a very wide range of 
platforms. Asking mirror ops to adopt a new tool for mirroring one mirror, when 
they often have several or more, likely won't be met with much enthusiasm and 
would create two tiers of CPAN mirrors, those using rsync and those not, which 
would not only complicate something which should remain simple but, again, 
doesn't address the size of the archive and the multitude of small files that 
are always a consideration no matter what you're serving them up with.


Ah, you're one of them.  All objects look like nails when all you have is a
hammer, eh?  Rsync is a good tool, but like Perl, it isn't the perfect tool
for all tasks.  You've obviously exceeded what the tool was designed for,
it's only logical to look for (or write) another tool.  Ironically, what I'm 
suggesting is so basic that rsync can be replaced by a script which will 
likely run on every mirror out there with no more fuss than rsync.



FTP? It's 2010 and very few corp firewalls allow ftp in or out. I can't 
remember the last time I even used ftp come to think of it. I had to go through 
2 layers of network red tape just to get rsync for a particular system I wanted 
to mirror CPAN to at work. Asking for FTP would have been met with a big no or 
a cackle, depending on which of the nyetwork masters got the request first.


Sounds like you may be hamstrung by your own bureacracy, but that's rarely
the case in most the places I've worked.  Not to mention that between
passive mode FTP or even using an HTTP proxy (most of which support FTP
requests) what I'm proposing is relatively painless, simple, and easy to
secure.  This concern I suspect is a non-issue for most mirror operators.
Even if it was, allow them to pull it via HTTP for all I care.  Either one
is significantly more efficient than rsync.


How is replacing rsync, a standard and widely used tool, simpler for mirror 
ops? I suppose I don't understand the opposition to trimming off the obvious 
cruft on CPAN to lighten the load when BackPAN exists to archive them. There is 
already CPAN::Mini (which was created back when CPAN was an ever-so-tiny 1.2GB) 
so it's not as though lightening the load is a new idea or an unwelcome one.


I'm not opposed to trimming the cruft, but I am opposed to ignorant
knee-jerk reactions bereft of any empirical data (or at least you haven't
shared).  The cruft, while being cruft, isn't inherently evil.  You have a
basic I/O and state problem.  And the I/O generated is predominantly caused 
by rsync trying to (re)assemble state on the file set, *per* request.  More

appallingly, most of that state image being generated is state that hasn't
changed in quite awhile.  Literally years in many cases.  So why are we
wasting cycles  I/O performing massively redundant work?

That's why having PAUSE implement a transaction log, and perhaps a cron job
on the master server doing daily checkpointed file manifests is so much more
efficient.  An in-sync mirror only needs to download the lastest transaction
logs and play them forward (delete certain files, download others, etc).
And, gee, just about every author on the list could write *that* sync agent
in an evening.  Out-of-sync mirrors can start by working off the checkpoint
manifest, get what's missing, and rolling forward.

What you're overlooking is that CPAN has, and will, continue to grow.  Even 
if you remove the cruft now at some point it might grow to the same size 
just with fresh files.  When that happens, you're right back where you are 
now.  Rsync can't cut it, it wasn't designed for this.


Whether you like it or not, even on a pared down CPAN rsync is easily your
most inefficient process on the server.  If you're not willing to optimize
that, then you really don't care about optimization at all.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Arthur Corliss

On Fri, 26 Mar 2010, Andy Lester wrote:


Absolutely.  This factual info would ideally look like this:

Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year 
older than the most recent one.  If those distros with versions more than a year out of 
date were purged, the number of files would decrease from 200,000 to 120,000.  This would 
save 7GB out of the 12GB that a full CPAN mirror takes now.  Removing that 7GB would mean 
Benefit X to mirror owners.

Without that, how can module authors be bothered to care?


If you don't mind me interjecting, I still can't be bothered to care.  We
have basically a 12GB data set, and we're worried about that?  I see that a
small barrier to bringing on new mirrors on constrained pipes, but
ultimately that's not that big a deal.  Hell, there's single versions of
some Linux distros that are bigger than that.

End sum:  I personally don't think this is the most pressing issue facing
CPAN.  Just issue a best practices guide to all the module authors (or
include it as on-line documentation in PAUSE) and be done with it.

--Arthur Corliss
  Live Free or Die


Re: Trimming the CPAN - Automatic Purging

2010-03-26 Thread Arthur Corliss

On Fri, 26 Mar 2010, Ask Bj?rn Hansen wrote:


I find it curious that everyone who's actually involved in syncing the files or running 
mirror servers seem to think it generally sounds like a good idea and everyone who 
doesn't say it's not worth the effort.


Sure, I don't run a CPAN mirror, but I do manage many, many terrabytes of
storage as part of my day job.  I think it's a tad presumptuous to disregard
input just because we're not in your inner sanctum.  As I mentioned in a
follow up e-mail:  this is simply a matter of selecting the correct problem
domain.  I believe that streamlining the mirroring process will provide
greater gains for less effort.

That's not to say that pursuing other efficiencies isn't worthwhile, just
that you need to prioritize.

But what the hell do I know.  I don't run a *CPAN* mirror, so I must be
freaking clueless...

--Arthur Corliss
  Live Free or Die