Re: Trimming the CPAN - Automatic Purging
It hasn't been done because its outside of the scope of design for rsync. It's meant to sync arbitrary filesets in which many, if not all, changes are made out of band. It's decidely non-trivial to implement in that mode unless you're willing to accept a certain window in which your database may be out of date. But, in a situation like PAUSE, where the avenues in which files can be introduced into the file sets is controlled, it does become trivial. It's the gatekeeper, it knows who's been in or out. so the requirements for the Solution To The Problem Which Solves A More General Problem Than The Immediate Problem And Will Therefore Make Whoever Sets It Up A Hero include a replacement for the current mirroring technology stack that is tailored to mirroring distributions possibly including on-demand caching and expiration and that is trivial to install -- something like perl -MCPAN -e 'install STTPWSAMGPTTIPAWTMWSIUAH::Mirrorsuite' nohup nice nice perl -MSTTPWSAMGPTTIPAWTMWSIUAH::Mirrorsuite -e 'mirror cpan.org .'
Re: Trimming the CPAN - Automatic Purging
On Apr 2, 2010, at 1:50, Arthur Corliss wrote: And my assertion has been that the excessive stats by the server are a bigger impediment to synchronization than the inode count. Well, then one of us don't understand how file systems etc work. :-) - ask
Re: Trimming the CPAN - Automatic Purging
On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote: On Apr 2, 2010, at 1:50, Arthur Corliss wrote: And my assertion has been that the excessive stats by the server are a bigger impediment to synchronization than the inode count. Well, then one of us don't understand how file systems etc work. :-) Indeed. If you're running UFS perhaps you might have a gripe. But with many filesystems in use supporting dynamic allocation groups with the inode data stored near the actually data blocks, along with b-tree indexing, this isn't as much of an issue for many of us. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Thursday 01 April 2010 05:39:27 David Nicol wrote: On Wed, Mar 31, 2010 at 7:43 AM, Ask Bjørn Hansen a...@perl.org wrote: The main point here is that we can't use 20 inodes per distribution. so don't. How much reengineering would be needed to keep CPAN in a database instead of a file system? It'd mean each and every mirror operator changing how they sync their mirrors, and how access is provided... Currently, it's dead simple to sync a copy of CPAN via rsync, offer it up via whatever combination of HTTP, FTP and rsync you prefer, and job done - you're doing a valuable public service by offering a CPAN mirror. Make that process a lot harder (setting up database replication, custom scripts, etc etc) and a lot of people just won't do it. There's a lot to be said for keeping things simple. (FWIW, I run mirrors.uk2.net, and appreciated the fact it was simple and easy to get a mirror up and running without investing much time at all. Personally, I have no real problem with the current size of CPAN or the overhead of updating via rsync, but that's just my opinion.) Cheers Dave P
Re: Trimming the CPAN - Automatic Purging
Much of this discussion is beyond my depth but in terms of keeping it simple, and trying to limit the stat calls on the upstream servers, what about DNS as a replication model? You could break up the tree at logical divisions similar to zones and assign them serial numbers (say a .serial file) and then still use rsync, but broken up into modules to avoid recursion into sub-trees where the serial number is up to date? The rsyncd.conf could be published also so replicas use the same include/exclude logic. -lee
Re: Trimming the CPAN - Automatic Purging
On Fri, 2 Apr 2010, Ask Bj?rn Hansen wrote: I can't believe I'm doing this, but ... :-) All for entertainment's sake... The main point here is that we can't use 20 inodes per distribution. It's Just Nuts. Sure, it's only something like 400k files/inodes now - but at the rate it's going it'll be a lot more soon enough. Thats a problem, but not likely the biggest drag on server I/O you're suffering. Might that be ahem rsync? That reply doesn't even make sense. Then you've ignored most of this thread. Inode counts themselves aren't indicative of anything. It's the I/O access patterns that are. And my assertion has been that the excessive stats by the server are a bigger impediment to synchronization than the inode count. You're right, I'm not arguing the need for the cruft. I've only pointed out the obvious reality that trimming files only postpones the I/O management issues that at some time are likely going to have to be addressed, anyway. And that you'll get less bang for the buck (or man hour) by treating the symptoms, not the disease. For the record: if that's what you want to do, have at it. Let's just not be disingenuous about the fact that we're abrogating our responsibilities as technologists by refusing to address the real problems and weaknesses of the platform. You are confusing we, I and you again. Perhaps. Yes, I (and I'm guessing everyone else who have thought about it for more than say 5 seconds) agree that having rsync remember the file tree to save the disk IO for each sync sounds like an obvious solution. But reality is more complicated. If it was such an obviously good solution someone would have done it by now. (For starters play this question: What is the kernel cache?). It hasn't been done because its outside of the scope of design for rsync. It's meant to sync arbitrary filesets in which many, if not all, changes are made out of band. It's decidely non-trivial to implement in that mode unless you're willing to accept a certain window in which your database may be out of date. But, in a situation like PAUSE, where the avenues in which files can be introduced into the file sets is controlled, it does become trivial. It's the gatekeeper, it knows who's been in or out. Andreas' solution is much more sensible -- and as have been pointed out before we DO USE THAT; but the problem here is not with clients who are interested enough to do something special and dedicate resources to their CPAN mirroring. By all means, I'm not opposed to any solution that actually addresses the problem. I don't agree that would be the fast time to implementation, but no questions as to whether File::Rsync::Mirror::Recent would help things. I'd support (and help) that goal. My objections are more properly directed to those stuck on just deleting files from the tree. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Wed, Mar 31, 2010 at 01:03:51PM +1100, Adam Kennedy wrote: I've said nothing till now, because I figured more noise wouldn't help much. But I quite like the rsync daemon/proxy idea, and as it so happens I'm attending the OzLabs Unconference in 3 weeks time to hang out with Tridge, Rusty and the other Australia C/Kernel/Samba/RSync elites. So I'd be happy to raise any issues or ideas in this area with them in person over beers. I can see two possibly useful things (and I have no idea if either is yet possible, or a great understanding of how the protocol works) 1: stateful rsync daemon which doesn't scan all the time, either by a: Actually having a means to update b: Simply telling fibs, and pretending that the file system it scanned $n minutes ago is still current. (Which I think would work, at least for a mirror where files aren't edited (much) - if the server discovers that the client's view of that file *is* out of date, then scan that file for real, and give the up to date truth) 2: federated (or federate-able) server (or proxy) - so that you can say hand this subtree off to that other server This would allow the (fast, existing, C) rsync server to serve most of (say) funet.fi, handing off to a stateful server for the CPAN subtree. Nicholas Clark
Re: Trimming the CPAN - Automatic Purging
On Mar 31, 2010, at 6:52, David Nicol wrote: new proposal: Make modules pay rent in order to remain on a mirror. Rent could be in the form of actual user interest, or good reviews. How you are proposing purging useless stuff from CPAN -- that's a lot more radical than Tim's proposal of just purging _old_ useless stuff. - ask
Re: Trimming the CPAN - Automatic Purging
On Wed, Mar 31, 2010 at 7:43 AM, Ask Bjørn Hansen a...@perl.org wrote: The main point here is that we can't use 20 inodes per distribution. so don't. How much reengineering would be needed to keep CPAN in a database instead of a file system?
Re: Trimming the CPAN - Automatic Purging
On Sun, Mar 28, 2010 at 11:48:00AM -0500, Randy Kobes wrote: Has some sort of disk quota system for CPAN author accounts ever been considered? There are authors with 100 distributions. There are authors with just one distribution. There are authors with big distributions, and authors with only tiny distributions. I'd not be in favour of anything like that, which would impose burdens on authors (prolific authors - the most prolific being pumpkings: perl lives in their PAUSE directories - would have to contact admins to get their quota increased) and on the volunteer admins (who would have to decide whether to increase someone's quota or not). OK, so I have a vested interest: my CPAN directory is, in terms of size, number 37 out of 4900-something, because I have two *really* big distributions. For both of those I delete older versions when I think it appropriate. However, the load on rsync servers doesn't really come from the size of files - no matter whether you use rsync or some other protocol, they still have to serve those big files out at some point, once to each person who mirrors from them. The real load is the *number* of files, and hence the number of stats they have to do when someone asks rsync for changes. If you really want to reduce the load, how about getting rid of the CHECKSUM files and all the extracted blah.readme files in authors' directories? I'm kinda tempted to say the same about the .meta files as well, although I imagine they're more useful to some downstream reusers of the archive. -- David Cantrell | Nth greatest programmer in the world When one has bathed in Christ there is no need to bathe a second time -- St. Jerome, on why washing is a vile pagan practice in a letter to Heliodorus, 373 or 374 AD
Re: Trimming the CPAN - Automatic Purging
On Mon, Mar 29, 2010 at 12:02:11AM -0800, Arthur Corliss wrote: I think it would be a worthy cause ultimately, but certainly a much longer time to implementation, and considerably more effort. Kind of sounds like the normal stonewalling I've been getting these last few days by our resident rsync fetishists. Very ironic. I use the hell out of rsync, just more discriminately that you guys, and yet I'm public enemy number one. You know how I use it? Damn, I don't remember giving you accounts on my machines so you could look at my cron jobs. Live Free or Die Try living polite. -- David Cantrell | semi-evolved ape-thing You can't spell AWESOME without ME!
Re: Trimming the CPAN - Automatic Purging
Hi Elaine, Elaine Ashton wrote: On Mar 28, 2010, at 12:48 PM, Randy Kobes wrote: Jarkko and I were talking about it this morning - as he's not in favour of pruning - while trying to think of a way around the size problem and he reminded me of the idea that, if I recall correctly was Adreas' suggestion a while back, there be an A, B and C 'PAN' of sorts where you could pull varying degrees of content - sort of CPAN:Mini writ large. I don't think that idea ever got any traction because it wouldn't really solve some of the issues for the major upstream mirrors and the mechanics of deciding where to draw the lines between them. I still think it's a good idea though. This sounds a bit like the CPAN - backpan scheme but with some additional levels? I do very much like Tim's proposal for giving old modules a push to BackPAN since, with proper communication of the changes to the authors along with a way to mark exceptions, this would rid CPAN of a lot of cruft that should be on BackPan anyway. I'm not even going to throw in my considerable weight on this whole debate of pruning*. But if backpan became the official way to access old versions starting from yesterday's, wouldn't that mean: a) That the toolchain would have to be adapted to a tiered infrastructure (think of the indexes...) and more importantly: b) The backpan would have to be mirrored all over the place as well, thus pushing the problem to the next level? Best regards, Steffen * If you must know, I don't like the means but sympathize with the goals. PS: This isn't targeted at Elaine specifically, but can everybody please take a step back and relax? Please be civil.
Re: Trimming the CPAN - Automatic Purging
On Sun, 28 Mar 2010, Andy Armstrong wrote: We're nearly there if A == a CPAN::Mini style mirror, B == the current mirror pruned and C == backpan. So the actions to make that happen are: * give the current clients specific support for this * generate a master mini mirror that other mini mirrors can pull from. * prune If we agree that this is a good solution I'm happy to do some work on it - I could host the mini master and I'd be happy to send Andreas a patch for CPAN.pm to support this scheme. It should be pointed out that this is only viable under the assumption that you have a separate pool of servers for each tier. Again, this is just load balancing, not load optimization. That said, if you have the volunteers, then why not. Perhaps I can offer a system to support mirroring up here in Alaska. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Sat, Mar 27, 2010 at 09:38:16PM -0400, Elaine Ashton wrote: I suppose I don't understand the opposition to trimming off the obvious cruft on CPAN to lighten the load when BackPAN exists to archive them. There is already CPAN::Mini (which was created back when CPAN was an ever-so-tiny 1.2GB) so it's not as though lightening the load is a new idea or an unwelcome one. My understanding is that CPAN::Mini is aimed more at end-users who want to have CPAN-onna-(memory)-stick or on a laptop. Back in 2004, dedicating 1.2GB of laptop space was rather more significant than it is now - my laptop at the time had something like 30GB, and that had to include the OS and all my mp3s. A CPAN-onna-stick was very useful at hackathons and on train journeys. -- David Cantrell | Official London Perl Mongers Bad Influence I think the most difficult moment that anyone could face is seeing their domestic servants, whether maid or drivers, run away -- Abdul Rahman Al-Sheikh, writing at http://www.arabnews.com/?article=38558
Re: Trimming the CPAN - Automatic Purging
On Sat, Mar 27, 2010 at 10:52:05AM -0800, Arthur Corliss wrote: I think I was quite explicit in saying that efficiencies should be pursued in multiple areas, but the predominant bitch I took away from your thread dealt with the burden of synchronizing mirrors. What's the easiest way to address that pain? I don't believe it's your method. I'd look into the size issue *after* you address the incredible inefficiencies of a simple rsync. I You? Or someone else? I am quite happy to agree that your understanding and experience of storage management is better than mine. But that's not the key question, in a volunteer organisation. The questions I ask, repeating Jan's comments in another message, are. Nicholas Clark
Re: Trimming the CPAN - Automatic Purging
On Mar 28, 2010, at 12:52 AM, Arthur Corliss wrote: :-) You'll have to pardon my indiscriminate epithets. The barbs are coming from multiple directions. My point still stands, however. Your experience, however worthy, has zero bearing on whether or not my experience is just as worthy. Even moreso when you guys have zero clue who you're talking to. And you shouldn't have to know. I would have thought simple communal and professional courtesy would be extended and all points considered in earnest. Which does not appear to be the case. I'm not sending any barbs, only my reasonable opinion borne from years on the reality-based operations side of this equation. As for who you are, it doesn't matter as I work daily with those who wrote, and continue to write, large chunks of operating systems, X, etc., and though their legend may precede them when it comes to my having to implement what works fabulously in their imagination, I do my best to bring them back to the grim reality that is operations. It's a frequent problem of engineers and those of us stuck having to live with and fix their grand ideas. Lofty goals usually die somewhere between dreams and production. Ah, you're one of them. All objects look like nails when all you have is a hammer, eh? Rsync is a good tool, but like Perl, it isn't the perfect tool for all tasks. You've obviously exceeded what the tool was designed for, it's only logical to look for (or write) another tool. Ironically, what I'm suggesting is so basic that rsync can be replaced by a script which will likely run on every mirror out there with no more fuss than rsync. Well, you'll have to forgive those who mock your näivete as if it were so basic and trivial to replace rsync, it would have been done several times over by now as it's limitations are well known to all who use it on any large scale. However, it is a well-known, well-used, multi-platform and time-tested tool that will not be unseated very easily without good reason and a reason that reads something along the lines of improving performance on an archive that should have been trimmed back a bit is not a compelling reason for adoption. What you're overlooking is that CPAN has, and will, continue to grow. Even if you remove the cruft now at some point it might grow to the same size just with fresh files. When that happens, you're right back where you are now. Rsync can't cut it, it wasn't designed for this. And this is a good point to make, yes, it will continue to grow and I know that the current manager(s) of nic.funet.fi have commented on the burden it presents to the system which is also home to a number of other mirrors. You cannot assume that the generosity and the resources of the mirror ops are limitless and finding out where that limit lies will come too late to make amends. Pruning back the archive is a good compromise until and unless another solution can be done that will not bother the mirror ops terribly much in terms of real work. e.
Re: Trimming the CPAN - Automatic Purging
On Sat, Mar 27, 2010 at 08:52:22PM -0800, Arthur Corliss wrote: On Sat, 27 Mar 2010, Elaine Ashton wrote: Actually, I thought I was merely offering my opinion both as the sysadmin for the canonical CPAN mothership and as an end-user. If that makes me a prick, well, I suppose I should go out and buy one :) :-) You'll have to pardon my indiscriminate epithets. The barbs are coming from multiple directions. My point still stands, however. Your experience, however worthy, has zero bearing on whether or not my experience is just as worthy. Even moreso when you guys have zero clue who you're talking Are you running a large public mirror site, where you don't even have knowledge of who is mirroring from you? (Not even knowledge, let alone channels of communication with, let alone control over) Because (as I see it, not having done any of this) the logistics of that is going to have as much bearing on trying to change protocols as the actual technical merits of the protocol itself. Most of the cost of rsync is an externality to the clients. If one has an existing mirror, one is using rsync to keep it up to date, what's the incentive to change? Sounds like you may be hamstrung by your own bureacracy, but that's rarely the case in most the places I've worked. Not to mention that between passive mode FTP or even using an HTTP proxy (most of which support FTP requests) what I'm proposing is relatively painless, simple, and easy to secure. This concern I suspect is a non-issue for most mirror operators. Even if it was, allow them to pull it via HTTP for all I care. Either one is significantly more efficient than rsync. I'm missing something here, I suspect. How can HTTP be more efficient than rsync? The only obvious method to me of mirroring a CPAN site by HTTP is to instruct a client (such as wget) to get it all. In which case, in the course of doing this the client is going to recurse over the entire directory tree of the server, which, I thought, was functionally equivalent to the behaviour of the rsync server. Nicholas Clark
Re: Trimming the CPAN - Automatic Purging
On 2010-03-28, at 9:13 AM, Elaine Ashton wrote: On Mar 28, 2010, at 12:52 AM, Arthur Corliss wrote: What you're overlooking is that CPAN has, and will, continue to grow. Even if you remove the cruft now at some point it might grow to the same size just with fresh files. When that happens, you're right back where you are now. Rsync can't cut it, it wasn't designed for this. And this is a good point to make, yes, it will continue to grow and I know that the current manager(s) of nic.funet.fi have commented on the burden it presents to the system which is also home to a number of other mirrors. You cannot assume that the generosity and the resources of the mirror ops are limitless and finding out where that limit lies will come too late to make amends. Pruning back the archive is a good compromise until and unless another solution can be done that will not bother the mirror ops terribly much in terms of real work. e. Has some sort of disk quota system for CPAN author accounts ever been considered? -- best regards, Randy
Re: Trimming the CPAN - Automatic Purging
On Sun, 28 Mar 2010, Ask Bj?rn Hansen wrote: You are misunderstanding the problem of changing the mirroring mechanism. I am not misunderstanding, I'm just willing to accept the reality for what it is. Rsync does not scale. Period. Making new software is nice and good -- Andreas already has something that's better for the PAUSE data. G That makes my point all the more compelling, then. Some of the work has already been done. Getting 1000s of mirrors to use your software (rather than rsync which they use for ALL OTHER mirrors -- not so easy. Perhaps, but it's also possible that it might not be as bad as you think, either. You have a strong case to be made that the entire ecosystem benefits from making this change (particularly in a tiered mirroring environment), and I'd be surprised if the majority of the mirror operators aren't sympathetic and cooperative. As a sys-admin I watch my SAR reports like a hawk, I'm sure they're no different. And that's not to say you have to eliminate rsync. If you can get half of them to stop, you'll still have some significant long term gains. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Sun, 28 Mar 2010, Elaine Ashton wrote: I'm not sending any barbs, only my reasonable opinion borne from years on the reality-based operations side of this equation. As for who you are, it doesn't matter as I work daily with those who wrote, and continue to write, large chunks of operating systems, X, etc., and though their legend may precede them when it comes to my having to implement what works fabulously in their imagination, I do my best to bring them back to the grim reality that is operations. It's a frequent problem of engineers and those of us stuck having to live with and fix their grand ideas. Lofty goals usually die somewhere between dreams and production. Ah, let the chest thumping begin. My point is that regardless of where the idea comes from if it comes from a solid rationale it should be given consideration. And to date I have yet to see any one of you refute my technical understanding of the problem, only my political understanding of the problem. I/O is the issue, and it is driven predominantly by rsync. Well, you'll have to forgive those who mock your n?ivete as if it were so basic and trivial to replace rsync, it would have been done several times over by now as it's limitations are well known to all who use it on any large scale. However, it is a well-known, well-used, multi-platform and time-tested tool that will not be unseated very easily without good reason and a reason that reads something along the lines of improving performance on an archive that should have been trimmed back a bit is not a compelling reason for adoption. Naivete? Again: show me where my assertions about the primary root of your problem is incorrect? Show me how pruning CPAN isn't a temporary band-aid that fails to address a fundamental weakness in the syncing process? you haven't. You can try to dress it up any way you like in effort to discredit me, but until you do based on the facts, you have nothing. Rsync is a good tool, but for different use case scenarios. And this is a good point to make, yes, it will continue to grow and I know that the current manager(s) of nic.funet.fi have commented on the burden it presents to the system which is also home to a number of other mirrors. You cannot assume that the generosity and the resources of the mirror ops are limitless and finding out where that limit lies will come too late to make amends. G And you make my point for me. I'm sure he would love to find a more efficient use of his I/O. I assume nothing, I only allow that you'll find more interest than you assume in managing I/O. Nor does what I'm proposing preclude the intractable from continuing to use rsync. Given that rsync is your driver of the I/O problem taking away any significant percentage of the problem with have the largest dividends. Pruning back the archive is a good compromise until and unless another solution can be done that will not bother the mirror ops terribly much in terms of real work. At least you admit you're only treating the symptoms now, not the disease itself. Sure, it will buy you some time, but there'll also be some political problems to work through which will likely burn as much if not more manhours than just treating the disease. And in the end time runs out and the problem remains. Look, I don't care if you guys decide against it, but let's be honest about the compromises you're making. Hell, pruning isn't even a compromise, it's not a solution, it's only a delaying tactic. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Mar 28, 2010, at 12:48 PM, Randy Kobes wrote: Has some sort of disk quota system for CPAN author accounts ever been considered? Not specifically, no, at least not that I'm aware of. That would have to be implemented on PAUSE and quotas frequently end up not solving the real problem and create a headache both for the sysadmin and the users. Jarkko and I were talking about it this morning - as he's not in favour of pruning - while trying to think of a way around the size problem and he reminded me of the idea that, if I recall correctly was Adreas' suggestion a while back, there be an A, B and C 'PAN' of sorts where you could pull varying degrees of content - sort of CPAN:Mini writ large. I don't think that idea ever got any traction because it wouldn't really solve some of the issues for the major upstream mirrors and the mechanics of deciding where to draw the lines between them. I still think it's a good idea though. I do very much like Tim's proposal for giving old modules a push to BackPAN since, with proper communication of the changes to the authors along with a way to mark exceptions, this would rid CPAN of a lot of cruft that should be on BackPan anyway. e.
Re: Trimming the CPAN - Automatic Purging
On 28 Mar 2010, at 19:32, Elaine Ashton wrote: Jarkko and I were talking about it this morning - as he's not in favour of pruning - while trying to think of a way around the size problem and he reminded me of the idea that, if I recall correctly was Adreas' suggestion a while back, there be an A, B and C 'PAN' of sorts where you could pull varying degrees of content - sort of CPAN:Mini writ large. I don't think that idea ever got any traction because it wouldn't really solve some of the issues for the major upstream mirrors and the mechanics of deciding where to draw the lines between them. I still think it's a good idea though. We're nearly there if A == a CPAN::Mini style mirror, B == the current mirror pruned and C == backpan. So the actions to make that happen are: * give the current clients specific support for this * generate a master mini mirror that other mini mirrors can pull from. * prune If we agree that this is a good solution I'm happy to do some work on it - I could host the mini master and I'd be happy to send Andreas a patch for CPAN.pm to support this scheme. -- Andy Armstrong, Hexten
Re: Trimming the CPAN - Automatic Purging
On 27 Mar 2010, at 00:59, Elaine Ashton wrote: The only snag I can forsee in trimming back on the abundance of modules is the case where some modules have version requirements for other modules where it will barf with a mismatch/newer version of the required module (I bumped into this recently but can't remember exactly which module it was) but I think it's rare and the practise should be discouraged. Maybe that could be solved by having the clients (and maybe search.cpan.org) automagically fall back to a backpan mirror? And, yes, if it's considered a good idea I /am/ prepared to do something about it. -- Andy Armstrong, Hexten
Re: Trimming the CPAN - Automatic Purging
On 27 Mar 2010, at 00:59, Andy Armstrong wrote: On 27 Mar 2010, at 00:59, Elaine Ashton wrote: The only snag I can forsee in trimming back on the abundance of modules is the case where some modules have version requirements for other modules where it will barf with a mismatch/newer version of the required module (I bumped into this recently but can't remember exactly which module it was) but I think it's rare and the practise should be discouraged. Maybe that could be solved by having the clients (and maybe search.cpan.org) automagically fall back to a backpan mirror? And, yes, if it's considered a good idea I am prepared to do something about it. Exactly what I wrote in my previous mail, nobody commented I was wondering if I was wrong! In any case. We do now have a better understanding of the problem and most important we have a real user (Elaine) wishing for something to be done. Andreas, Chris, Tatsuhiko and other have done a tremendous job implementing stuff but I must admit that I would have like to see a list of what they are implementing. Not to mention the need to see a context diagram. IMVHO the first thing we should do is have a requirement list of what CPAN actors (clients, pause, mirrors, search engines, ...) should do. Maybe that document already exists somewhere. What implication we may have on CPAN, ExtUtils, Module::Build, and all other , still unknown, modules are, I believe, not to be under estimated. Andy (since you are the first to really volunteer (and now you don't have any choice anymore;)), count me in whatever development time is needed to get things moving. Ask, this thread is getting a tad long and although I'm very happy to see more input, requirements and ideas, Would it be possible to see a some condensed results somewhere? Cheers, Nadim.
Re: Trimming the CPAN - Automatic Purging
On Sat, 27 Mar 2010, Nicholas Clark wrote: I You? Or someone else? I am quite happy to agree that your understanding and experience of storage management is better than mine. But that's not the key question, in a volunteer organisation. The questions I ask, repeating Jan's comments in another message, are. Oh, I understand that fully. And I'd be happy to lend some of my time. But you don't make people inclined to help when people are lobbing snarky comments like we'll wait breathlessly for you to do it. The impression I'm getting from most of you right now is that you're hell bent on solving the problem your way, and no one is interested in exploring the technical merits of other approaches. Hell, I would even help with work towards your desired method *if* I thought that was the consensus after a genuine exchange and consideration of ideas. I definitely won't should it appear that we have some kind of elitist cabal that will make their decision in isolation. If that's going to be the case then this should have never been raised on an open forum like the module author's list. Quite frankly, at times some discussions on this list fail the concept of a technical meritocracy, and tend towards an established aristocracy. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Fri, Mar 26, 2010 at 03:02:22PM -0800, Arthur Corliss wrote: Why use rsync, then? Why not have checkpointed logs on cpan with additions/removals logged by date so you can roll forward on the client, processing only those files? It would be trivial to set up and a lot more efficient. Because the most important mirror sites mirror CPAN as just a very small part of what they do. They won't want to have to use weird tools for just that tiny corner of their disk. -- David Cantrell | London Perl Mongers Deputy Chief Heretic I caught myself pulling grey hairs out of my beard. I'm definitely not going grey, but I am going vain.
Re: Trimming the CPAN - Automatic Purging
Oh, I understand that fully. And I'd be happy to lend some of my time. But you don't make people inclined to help when people are lobbing snarky comments like we'll wait breathlessly for you to do it. The time-honored tradition of many open source communities is to talk. And talk. And talk. The problem is that this solves nothing. To do, does. You are free to decide to take this as a personal insult.
Re: Trimming the CPAN - Automatic Purging
On Sat, 27 Mar 2010, Jarkko Hietaniemi wrote: The time-honored tradition of many open source communities is to talk. And talk. And talk. The problem is that this solves nothing. To do, does. You are free to decide to take this as a personal insult. I didn't take it as an insult, I took it as what it was -- a dodge. You already have your minds made up and are not willing to evaluate options on their merits. Let's just be honest about what's going on here. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Sat, 27 Mar 2010, Elaine Ashton wrote: Actually, I thought I was merely offering my opinion both as the sysadmin for the canonical CPAN mothership and as an end-user. If that makes me a prick, well, I suppose I should go out and buy one :) :-) You'll have to pardon my indiscriminate epithets. The barbs are coming from multiple directions. My point still stands, however. Your experience, however worthy, has zero bearing on whether or not my experience is just as worthy. Even moreso when you guys have zero clue who you're talking to. And you shouldn't have to know. I would have thought simple communal and professional courtesy would be extended and all points considered in earnest. Which does not appear to be the case. And you're disregarding a considerable problem that rsync is a well-established tool for mirroring that is easy to use and works on a very wide range of platforms. Asking mirror ops to adopt a new tool for mirroring one mirror, when they often have several or more, likely won't be met with much enthusiasm and would create two tiers of CPAN mirrors, those using rsync and those not, which would not only complicate something which should remain simple but, again, doesn't address the size of the archive and the multitude of small files that are always a consideration no matter what you're serving them up with. Ah, you're one of them. All objects look like nails when all you have is a hammer, eh? Rsync is a good tool, but like Perl, it isn't the perfect tool for all tasks. You've obviously exceeded what the tool was designed for, it's only logical to look for (or write) another tool. Ironically, what I'm suggesting is so basic that rsync can be replaced by a script which will likely run on every mirror out there with no more fuss than rsync. FTP? It's 2010 and very few corp firewalls allow ftp in or out. I can't remember the last time I even used ftp come to think of it. I had to go through 2 layers of network red tape just to get rsync for a particular system I wanted to mirror CPAN to at work. Asking for FTP would have been met with a big no or a cackle, depending on which of the nyetwork masters got the request first. Sounds like you may be hamstrung by your own bureacracy, but that's rarely the case in most the places I've worked. Not to mention that between passive mode FTP or even using an HTTP proxy (most of which support FTP requests) what I'm proposing is relatively painless, simple, and easy to secure. This concern I suspect is a non-issue for most mirror operators. Even if it was, allow them to pull it via HTTP for all I care. Either one is significantly more efficient than rsync. How is replacing rsync, a standard and widely used tool, simpler for mirror ops? I suppose I don't understand the opposition to trimming off the obvious cruft on CPAN to lighten the load when BackPAN exists to archive them. There is already CPAN::Mini (which was created back when CPAN was an ever-so-tiny 1.2GB) so it's not as though lightening the load is a new idea or an unwelcome one. I'm not opposed to trimming the cruft, but I am opposed to ignorant knee-jerk reactions bereft of any empirical data (or at least you haven't shared). The cruft, while being cruft, isn't inherently evil. You have a basic I/O and state problem. And the I/O generated is predominantly caused by rsync trying to (re)assemble state on the file set, *per* request. More appallingly, most of that state image being generated is state that hasn't changed in quite awhile. Literally years in many cases. So why are we wasting cycles I/O performing massively redundant work? That's why having PAUSE implement a transaction log, and perhaps a cron job on the master server doing daily checkpointed file manifests is so much more efficient. An in-sync mirror only needs to download the lastest transaction logs and play them forward (delete certain files, download others, etc). And, gee, just about every author on the list could write *that* sync agent in an evening. Out-of-sync mirrors can start by working off the checkpoint manifest, get what's missing, and rolling forward. What you're overlooking is that CPAN has, and will, continue to grow. Even if you remove the cruft now at some point it might grow to the same size just with fresh files. When that happens, you're right back where you are now. Rsync can't cut it, it wasn't designed for this. Whether you like it or not, even on a pared down CPAN rsync is easily your most inefficient process on the server. If you're not willing to optimize that, then you really don't care about optimization at all. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Mar 26, 2010, at 4:55 AM, Lars Thegler wrote: I appreciate that the number of files on CPAN has implications for the infrastructure, but I feel a need to have some more factual info before conceding to such measures. Absolutely. This factual info would ideally look like this: Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year older than the most recent one. If those distros with versions more than a year out of date were purged, the number of files would decrease from 200,000 to 120,000. This would save 7GB out of the 12GB that a full CPAN mirror takes now. Removing that 7GB would mean Benefit X to mirror owners. Without that, how can module authors be bothered to care? xoxo, Andy -- Andy Lester = a...@petdance.com = www.theworkinggeek.com = AIM:petdance
Re: Trimming the CPAN - Automatic Purging
On Fri, 26 Mar 2010, Andy Lester wrote: Absolutely. This factual info would ideally look like this: Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year older than the most recent one. If those distros with versions more than a year out of date were purged, the number of files would decrease from 200,000 to 120,000. This would save 7GB out of the 12GB that a full CPAN mirror takes now. Removing that 7GB would mean Benefit X to mirror owners. Without that, how can module authors be bothered to care? If you don't mind me interjecting, I still can't be bothered to care. We have basically a 12GB data set, and we're worried about that? I see that a small barrier to bringing on new mirrors on constrained pipes, but ultimately that's not that big a deal. Hell, there's single versions of some Linux distros that are bigger than that. End sum: I personally don't think this is the most pressing issue facing CPAN. Just issue a best practices guide to all the module authors (or include it as on-line documentation in PAUSE) and be done with it. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Friday-201003-26 13:20, Arthur Corliss wrote: On Fri, 26 Mar 2010, Andy Lester wrote: Absolutely. This factual info would ideally look like this: Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year older than the most recent one. If those distros with versions more than a year out of date were purged, the number of files would decrease from 200,000 to 120,000. This would save 7GB out of the 12GB that a full CPAN mirror takes now. Removing that 7GB would mean Benefit X to mirror owners. Without that, how can module authors be bothered to care? If you don't mind me interjecting, I still can't be bothered to care. We have basically a 12GB data set, and we're worried about that? I see that a small barrier to bringing on new mirrors on constrained pipes, but ultimately that's not that big a deal. Hell, there's single versions of some Linux distros that are bigger than that. The total size is not the problem. The number of files is. Vanilla rsync is horribly inefficient (not the protocol, which is genius, mind) because a client coming by and asking for updates basically ends up requiring the moral equivalent of find . -type f -print. Let me repeat that: each client. Not fun.
Re: Trimming the CPAN - Automatic Purging
On Friday-201003-26 19:02, Arthur Corliss wrote: On Fri, 26 Mar 2010, Jarkko Hietaniemi wrote: The total size is not the problem. The number of files is. Vanilla rsync is horribly inefficient (not the protocol, which is genius, mind) because a client coming by and asking for updates basically ends up requiring the moral equivalent of find . -type f -print. Let me repeat that: each client. Not fun. Why use rsync, then? Why not have checkpointed logs on cpan with additions/removals logged by date so you can roll forward on the client, processing only those files? It would be trivial to set up and a lot more efficient. We wait your implementation breathlessly. By the time all the CPAN mirrors have started using that, we probably will be rather blue in the face. --Arthur Corliss Live Free or Die
Re: Trimming the CPAN - Automatic Purging
On Mar 26, 2010, at 16:02, Arthur Corliss wrote: Why use rsync, then? Why not have checkpointed logs on cpan with additions/removals logged by date so you can roll forward on the client, processing only those files? It would be trivial to set up and a lot more efficient. I find it curious that everyone who's actually involved in syncing the files or running mirror servers seem to think it generally sounds like a good idea and everyone who doesn't say it's not worth the effort. Anyway -- we have some other ideas for cutting down the number of files that we already agreed on but just needs announcement (which I promised to write up, oops). No, I'm not going to make Tim's mistake and suggest it here first. Tim: Next time just get the paint in your preferred color. :-) - ask
Re: Trimming the CPAN - Automatic Purging
On Fri, 26 Mar 2010, Ask Bj?rn Hansen wrote: I find it curious that everyone who's actually involved in syncing the files or running mirror servers seem to think it generally sounds like a good idea and everyone who doesn't say it's not worth the effort. Sure, I don't run a CPAN mirror, but I do manage many, many terrabytes of storage as part of my day job. I think it's a tad presumptuous to disregard input just because we're not in your inner sanctum. As I mentioned in a follow up e-mail: this is simply a matter of selecting the correct problem domain. I believe that streamlining the mirroring process will provide greater gains for less effort. That's not to say that pursuing other efficiencies isn't worthwhile, just that you need to prioritize. But what the hell do I know. I don't run a *CPAN* mirror, so I must be freaking clueless... --Arthur Corliss Live Free or Die
RE: Trimming the CPAN - Automatic Purging
On Fri, 26 Mar 2010, Arthur Corliss wrote: But what the hell do I know. I don't run a *CPAN* mirror, so I must be freaking clueless... It's not about what you know, but about what you are willing to do yourself. At some point you have to accept that the people who *do* the work decide *how* they do it. There is not much point in just talking to volunteers that they should not be doing something but instead be doing something else if you are not willing to take the burden of doing this other thing yourself. Volunteers are not free labor that the talking masses can direct with majority votes. :) Cheers, -Jan
Re: Trimming the CPAN - Automatic Purging
On Mar 26, 2010, at 8:23 PM, Arthur Corliss wrote: Sure, I don't run a CPAN mirror, but I do manage many, many terrabytes of storage as part of my day job. I think it's a tad presumptuous to disregard input just because we're not in your inner sanctum. As I mentioned in a follow up e-mail: this is simply a matter of selecting the correct problem domain. I believe that streamlining the mirroring process will provide greater gains for less effort. That's not to say that pursuing other efficiencies isn't worthwhile, just that you need to prioritize. But what the hell do I know. I don't run a *CPAN* mirror, so I must be freaking clueless... Oh, don't be such a drama queen. I rebuilt and helped run nic.funet.fi for 2 years which is the canonical mirror for a large number of mirrors and the perspective of having a few terabytes spinning in storage changes quite dramatically when you are actually serving a few terabytes to thousands of clients. CPAN grew to be quite a burden on the site not only because of the high demand, but also because of the multitude of small files and I'm sure other mirrors feel similarly burdened. The sort of pruning Tim brought up has long been an idea, but with the current and growing size of the archive, something does need to be done to alleviate the burden not only on the canonical mirrors, but also on the random folks who want to grab a local mirror for themselves. In my present work environment, 12gb isn't a lot of disk space, but it's a lot considering I don't need to install perl modules daily and the vast majority of it I'll likely never use. It would be a kindness to both the mirror operators and to the end-users to trim it down to a manageable size. As for efficiency, rsync remains a good tool for the job that works on nearly every platform which is a rather tall order to match with any other solution. Relegating the cruft to BackPAN to make the current CPAN slimmer and less demanding on all fronts is an idea that would be welcomed by more than just mirror ops. The only snag I can forsee in trimming back on the abundance of modules is the case where some modules have version requirements for other modules where it will barf with a mismatch/newer version of the required module (I bumped into this recently but can't remember exactly which module it was) but I think it's rare and the practise should be discouraged. e.
Re: Trimming the CPAN - Automatic Purging
--- On Thu, 25/3/10, David Golden xda...@gmail.com wrote: From: David Golden xda...@gmail.com I don't think it's a good idea to make it hard for people to find older versions of a distribution -- where hard means have to track it down on backpan. (Though we could make clients better about it, I supposed.) I don't have a particular opinion about this, but this issue could be mitigated if CPAN linked to the backpan. Cheers, Ovid -- Buy the book - http://www.oreilly.com/catalog/perlhks/ Tech blog- http://blogs.perl.org/users/ovid/ Twitter - http://twitter.com/OvidPerl Official Perl 6 Wiki - http://www.perlfoundation.org/perl6
Re: Trimming the CPAN - Automatic Purging
On Thu, Mar 25, 2010 at 11:12:32AM +, Tim Bunce wrote: Currently on PAUSE you have to explicitly delete old uploads. Which often is a good thing. While BACKPAN exists, it isn't somewhere that many go to look for old distributions. For me and probably others, BACKPAN only distributions are ones that have been specifically marked by the maintainers as obsolete, badly broken or similar. Automatic deletes from CPAN would change that. There are many distributions on CPAN that older versions work on a particular perl/os, but more recent ones don't. Latest isn't necessarily the greatest. If you are going to perform this then it should really feed off the CPAN Testers to know if a specific release has been marked as being the latest working release for a particular perl/os. I would also suggest extending the timeframe considerably to perhaps 3 or maybe 5 years. Lastly I would also personnally be annoyed if only the latest versions were available, as I often make great use of the diff tool on search.cpan.org. Having only the latest version renders that great tool redundant :( Files selected in this way would be scheduled to be deleted in a month and an email would be sent to the authors, just as if they'd selected the files for deletion via PAUSE. There are already many authors who have non-responding email addresses (I will get around to publicising that list at some point), so some will likely disappear down a blackhole. What if you're about to delete a set of distributions that should really be kept available? No one would be listening to know that it should still be kept. I would prefer a suggestion email to authors to delete, rather than an email telling them that their distributions will be deleted unless they do something. Cheers, Barbie. -- Birmingham Perl Mongers http://birmingham.pm.org Memoirs Of A Roadie http://barbie.missbarbell.co.uk CPAN Testers Blog http://blog.cpantesters.org YAPC Conference Surveys http://yapc-surveys.org
Re: Trimming the CPAN - Automatic Purging
On Mar 25, 2010, at 8:42 AM, Barbie wrote: Lastly I would also personnally be annoyed if only the latest versions were available, as I often make great use of the diff tool on search.cpan.org. Having only the latest version renders that great tool redundant :( I use that too :-) and it is very annoying that some authors automatically delete previous releases when they upload a new one. Graham.
Re: Trimming the CPAN - Automatic Purging
On Thu, Mar 25, 2010 at 01:42:58PM +, Barbie wrote: There are many distributions on CPAN that older versions work on a particular perl/os, but more recent ones don't. Latest isn't necessarily the greatest. If you are going to perform this then it should really feed off the CPAN Testers to know if a specific release has been marked as being the latest working release for a particular perl/os. You just described cpXXXan: http://cpxxxan.barnyard.co.uk/ -- David Cantrell | Bourgeois reactionary pig You know you're getting old when you fancy the teenager's parent and ignore the teenager -- Paul M in uknot
Re: Trimming the CPAN - Automatic Purging
What Jarkko said. On Mar 25, 2010, at 08:00, Jarkko Hietaniemi wrote: I have one case where the v1 and v2 of a module are simply incompatible, but v1 still works, and unless the users have a compelling reason, they won't migrate. Pulling the rug from under them would be quite unsportsmanlike. Deletion should be opt-in, and there should be a way to pin some releases as unreapable. And warning emails (yes, some email addresses are blackholes) to the author well in advance: your module X version Y will be deleted as you requested in Z weeks because there are P newer releases ... -- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen -- Chris Nandor pu...@pobox.com http://pudge.net/ Slashdot / Geeknet pu...@slashdot.org http://slashdot.org/
Re: Trimming the CPAN - Automatic Purging
On 25 Mar 2010, at 15:36, Chris Nandor wrote: I like that solution better [snip] But solution to what? Are we convinced there's actually a problem here? -- Andy Armstrong, Hexten
Re: Trimming the CPAN - Automatic Purging
On Mar 25, 2010, at 8:38, Andy Armstrong wrote: I like that solution better [snip] But solution to what? Are we convinced there's actually a problem here? CPAN has almost 200k files. www.cpan.org says there are 17627 modules. rsyncing a gazillion files doesn't work that well (on the server). Helping authors remember to delete things that are now irrelevant from the main CPAN system will make it easier to run mirrors and keep them fresh. - ask
Re: Trimming the CPAN - Automatic Purging
On Mar 25, 2010, at 8:38, Andy Armstrong wrote: I like that solution better [snip] But solution to what? Are we convinced there's actually a problem here? CPAN has almost 200k files. www.cpan.org says there are 17627 modules. rsyncing a gazillion files doesn't work that well (on the server). Helping authors remember to delete things that are now irrelevant from the main CPAN system will make it easier to run mirrors and keep them fresh. - ask So the problem is not a 'purging' problem (that a few confused with deleting modules) but more a synchronization problematic between the CPAN mirrors. I think we all agree that all modules should be kept safely somewhere but only few need to be synchronized to all the mirrors. ccpan, cpanp and cpanm (other?) could have a older_versions_url_list that would be used if the module version is not part of what the author/community want to be mirrored. very old versions are, I think, seldom asked for (something that would need figures to confirm). Also, I'd bet that 95% of Perl users don't know what BACKPAN is. Nadim.
Re: Trimming the CPAN - Automatic Purging
On Thu, Mar 25, 2010 at 02:08:45PM -0700, Geoffrey Broadwell wrote: Forgive a lurker, but wasn't that the point of this: http://search.cpan.org/~andk/File-Rsync-Mirror-Recent-0.0.7/ When I saw that announced, I remember thinking Yay, large archive rsync problem solved! Did it not work out? It currently supports all the fast CPAN mirrors. The CPAN Testers mirror is currently 10 seconds behind PAUSE :) Cheers, Barbie. -- Birmingham Perl Mongers http://birmingham.pm.org Memoirs Of A Roadie http://barbie.missbarbell.co.uk CPAN Testers Blog http://blog.cpantesters.org YAPC Conference Surveys http://yapc-surveys.org
Re: Trimming the CPAN - Automatic Purging
On Mar 25, 2010, at 13:23, Eric Wilhelm wrote: Maybe CPAN mirrors are more easily updated than via a generic rsync? Is the burden only network/cpu for checking whether a bunch of old archives have changed, or does disk matter? Most CPAN mirrors use rsync. It's not realistic to make them change that (Hello all mirror operators -- so that tool that you use for ALL YOUR MIRRORS; well ... maybe you can use something else for us?). rsync is all disk i/o -- relatively negligible network and CPU. - ask
Re: Trimming the CPAN - Automatic Purging
What he said. Most people don't mirror CPAN. They mirror many things. This is the same reason we've struggled with statistics. How do you ask someone mirroring three dozen different things to put in a special log-munging tool just for us. Adam K On Fri, Mar 26, 2010 at 10:55 AM, Ask Bjørn Hansen a...@perl.org wrote: Most CPAN mirrors use rsync. It's not realistic to make them change that (Hello all mirror operators -- so that tool that you use for ALL YOUR MIRRORS; well ... maybe you can use something else for us?).