Re: Using a better compression than .gz for one's CPAN modules
* Shlomi Fish shlo...@iglu.org.il [2010-11-26 22:05]: In any case, regardless of its age, xz does tend to compress better than bz2 and should also be faster. I know. I heard of it quite early and switched from bzip2 to xz for my database dumps and mail archives. That’s not the point of the quote though. New things are always better in some way. Why else would anyone make them? But things always exist in a broader context and it is rarely so straight- forward to find any of them superior on that level. That put aside sticking with an older solution may be preferable due to the better adoption ratios mentioned by David and others, but to quote George Bernard Show: The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.. ( http://en.wikiquote.org/wiki/George_Bernard_Shaw ). I agree with the notion. But let me ask how much pressure changing the compression format on CPAN would exert on the world to adapt itself to it. Note too the quote is written from the perspective of the world: no mention goes to the fortunes of the unreasonable man himself… Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Using a better compression than .gz for one's CPAN modules
On Sun, Nov 28, 2010 at 4:22 PM, Aristotle Pagaltzis pagalt...@gmx.de wrote: I agree with the notion. But let me ask how much pressure changing the compression format on CPAN would exert on the world to adapt itself to it. Note too the quote is written from the perspective of the world: no mention goes to the fortunes of the unreasonable man himself… I'm not sure which side you're arguing with that. Here's how I see it: allowing a new compression format means that someone will inevitably release a distribution with it that someone will try to install with an older toolchain that won't handle it. Based on my prior experience with other such issues, a large portion of the bug reports, complaints, nasty personal comments and what not will accrue to the toolchain and its maintainers and not the author who released the not-backwards-compatible distribution. Thus, I have no personal incentive as a toolchain co-maintainer to do the work, since the only thing I'll get back from it is a hassle. And since only when a significant fraction of CPAN is released in that format will the compression benefits add up, the hassles come quick and the benefits aren't seen for a long time. On the other hand, if someone wants to recompress all of CPAN into XYZ compression scheme and release their own client that deals with it (or patch cpanm or whatever), then they can have the benefits (and any resulting hassles) themselves. -- David
Re: Using a better compression than .gz for one's CPAN modules
* David Golden xda...@gmail.com [2010-11-28 22:45]: On the other hand, if someone wants to recompress all of CPAN into XYZ compression scheme and release their own client that deals with it (or patch cpanm or whatever), then they can have the benefits (and any resulting hassles) themselves. And note that distributions which ship packages for CPAN modules are effectively already doing exactly that. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Using a better compression than .gz for one's CPAN modules
* Shlomi Fish shlo...@iglu.org.il [2010-11-24 21:05]: Welcome to 2010. There are two kinds of fool. One says, “This is old, and therefore good.” And one says, “This is new, and therefore better.” —John Brunner Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Using a better compression than .gz for one's CPAN modules
On Fri, Nov 26, 2010 at 3:59 PM, Shlomi Fish shlo...@iglu.org.il wrote: There are two kinds of fool. One says, “This is old, and therefore good.” And one says, “This is new, and therefore better.” That put aside sticking with an older solution may be preferable due to the better adoption ratios mentioned by David and others, but to quote George Bernard Show: The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.. ( http://en.wikiquote.org/wiki/George_Bernard_Shaw ). Of the many places I choose to be unreasonable for the sake of progress, squeezing out a little bit more size reduction in tar balls is not where I'm going be spending my energies. C.f. http://www.dagolden.com/index.php/1148/bootstrapping-cpan-pm-using-httplite/ as well as the abbreviated auto CPAN config and the CPAN Mirror auto-selection in the current development series of CPAN.pm. -- David
Re: Using a better compression than .gz for one's CPAN modules
On Wed, Nov 24, 2010 at 09:59:59PM +0200, Shlomi Fish wrote: On Friday 19 November 2010 22:02:48 David Cantrell wrote: Even if it does, there's not much point. bzip2 support is nowhere near universal, and preventing lots of users from using your code would seem to be a poor trade-off for saving an insignificant number of bytes. One can easily install bzip2 to unpack that the distribution ... One can indeed easily install it. Unless one is a Windows user, or is on a platform which bzip2 doesn't support, or your workplace policies prevent you from installing it. As for the others, I've never heard of them. .xz is http://en.wikipedia.org/wiki/Xz . If I wanted to find out about them I could use google. I have no interest in weirdo file formats. Welcome to 2010. Social skills. You've no doubt heard of them. -- David Cantrell | Official London Perl Mongers Bad Influence You don't need to spam good porn
Re: Using a better compression than .gz for one's CPAN modules
On Sat, 20 Nov 2010 23:22:52 +0100, Aristotle Pagaltzis pagalt...@gmx.de said: It’s gonna be a lot of work to iron out the entire tool chain to support the newer formats; then it will take a lot of time until the work trickles out far enough that people could start relying on it. In the case of bzip2 I couldn't resist after having watched bzip2's acceptance for several years. So I prodded all toolchain authors to support bz2. It is now done and seems to work fine. For quite piddly gains, in absolute numbers. I really don’t see the point. Gzip is Good Enough. Agreed, but since bzip2 support is already done we can welcome it when people actually use it. -- andreas
Re: Using a better compression than .gz for one's CPAN modules
* Andreas J. Koenig andreas.koenig.7os6v...@franz.ak.mind.de [2010-11-22 09:20]: Agreed, but since bzip2 support is already done we can welcome it when people actually use it. I am unwilling to encourage it but I won’t argue if someone wants to use it. And it can be a win for distributions with very large bundled data files so one might as well use it for them since the support exists. I just don’t want to see a campaign against gzip. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)
On 19/11/2010 20:57, dhu...@hudes.org wrote: source code, even 100KLOC? Once you go to .gz you're already at better than 2:1. What are you going to save by going to even 3:1, 10Kbytes? compared to the nuisance inflicted, it's nothing. Over the entire CPAN archive, it'd be significant... I agree on the individual case it's probably not worth worrying about too much. But if it's easy to use .bz2 or something better it wouldn't hurt to get that word out. (And it may be worth making it easy, though I'm not sure about that.) Daniel T. Staal Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between mirrors. Compressing to .bz2 won't help that: the stress is doing a stat on every single file in CPAN not the transfer. Work toward optimizing the mirror distribution instead of worrying about bz2 vs gz. Remember not Yeah, this is the killer. In an ideal world, we would kill the symlinks such as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These could be recreated via shell scripts locally on mirrors for people who wish to maintain these legacies. Cutting that out would diminish the rsync burden considerably. David -- There's bum trash in my hall and my place is ripped I've totaled another amp, I'm calling in sick
Re: Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)
On Mon, Nov 22, 2010 at 4:37 AM, David Landgren da...@landgren.net wrote: Yeah, this is the killer. In an ideal world, we would kill the symlinks such as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These could be recreated via shell scripts locally on mirrors for people who wish to maintain these legacies. Cutting that out would diminish the rsync burden considerably. David or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the mirroring process as a database mirror via a TBD compact database diff protocol (I have no intention of doing any of this myself; good morning) -- It is merely a matter of persistence. -- Albert Camus
Re: Using a better compression than .gz for one's CPAN modules
* Shlomi Fish shlo...@gmail.com [2010-11-19 19:55]: here is a report on compressing Graph-Easy-0.70.tar with various compression methods: {{{ shlomif:~/progs/perl/cpan/Graph/Easy/trunk/Graph-Easy/TEMP$ ls -l total 3420 -rw-r--r-- 1 shlomif shlomif 2160640 Nov 14 22:20 Graph-Easy-0.70.tar -rw-r--r-- 1 shlomif shlomif 329197 Nov 5 12:24 Graph-Easy-0.70.tar.bz2 -rw-r--r-- 1 shlomif shlomif 416916 Nov 14 22:23 Graph-Easy-0.70.tar.gz -rw-r--r-- 1 shlomif shlomif 270796 Nov 14 22:21 Graph-Easy-0.70.tar.lrz -rw-r--r-- 1 shlomif shlomif 312844 Nov 5 12:24 Graph-Easy-0.70.tar.xz }}} As one can see, there are significant savings in size (and bandwidth) by switching to .bz2 and .xz. Where does one see that? I see some savings, but not significant ones. You drop from 2 MB to 400 kb by using gzip, then a further 100 to 150 kb by using more unusual compression programs. Just going to http://search.cpan.org/dist/Graph-Easy/ will pull down more data than you just saved. The initial savings is worthwhile, but the additional gains? The era of 28.8 modems is long past. (And even in areas where internet connectivity is bad, bandwidth is not the limiting factor. You go from cell phone with data plan to satellite internet to CD-ROMs delivered by truck: the scarce resource becomes latency, not the bandwidth at any one instant.) Gzip has 100% installed base. Even bzip2 does way worse; it has 100% installed base if you are looking at Linux and the 386BSD family, but is way less commonplace elsewhere, esp. Windows. And the other tools are only just making inroads on Linux. How long until they’re as widespread as bzip2? How long until bzip2 is as widespread as gzip? How large is the total CPAN archive – 10 GB? Re-compressing all of it now would yield a benefit of what, 3 GB? 4? Even 5 maybe? As Dave said, it fits on a thumb drive already. And we’re not even talking about re-compressing here, just about future support for new distributions. It’s gonna be a lot of work to iron out the entire tool chain to support the newer formats; then it will take a lot of time until the work trickles out far enough that people could start relying on it. For quite piddly gains, in absolute numbers. I really don’t see the point. Gzip is Good Enough. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Using a better compression than .gz for one's CPAN modules
While I completely agree with Aristotle I wish to clarify that Solaris 10 and 11 ship with bzip2. I can't recall about Solaris 9 and I am recalling this was not the case with 8 and earlier. Sent from my BlackBerry® smartphone with Nextel Direct Connect
Using a better compression than .gz for one's CPAN modules
Hi all, here is a report on compressing Graph-Easy-0.70.tar with various compression methods: {{{ shlomif:~/progs/perl/cpan/Graph/Easy/trunk/Graph-Easy/TEMP$ ls -l total 3420 -rw-r--r-- 1 shlomif shlomif 2160640 Nov 14 22:20 Graph-Easy-0.70.tar -rw-r--r-- 1 shlomif shlomif 329197 Nov 5 12:24 Graph-Easy-0.70.tar.bz2 -rw-r--r-- 1 shlomif shlomif 416916 Nov 14 22:23 Graph-Easy-0.70.tar.gz -rw-r--r-- 1 shlomif shlomif 270796 Nov 14 22:21 Graph-Easy-0.70.tar.lrz -rw-r--r-- 1 shlomif shlomif 312844 Nov 5 12:24 Graph-Easy-0.70.tar.xz }}} As one can see, there are significant savings in size (and bandwidth) by switching to .bz2 and .xz. .lrz (see http://ck.kolivas.org/apps/lrzip/ ) yields even more in its ZPaq preset, but at the cost of longer compression and even decompression times, so it's not preferable. My question is: 1. Will the CPAN testing and downloading toolchian will handle modules uploaded as .tar.bz2? (Allow to install them, unpack them, etc.) How about tar.xz. 2. Can I easily pack archives into tar.bz2 or tar.xz using Module-Build and/or Module-Install ? Regards, -- Shlomi Fish -- -- Shlomi Fish http://www.shlomifish.org/ Electrical Engineering studies. In the Technion. Been there. Done that. Forgot a lot. Remember too much.
Re: Using a better compression than .gz for one's CPAN modules
On Fri, Nov 19, 2010 at 1:53 PM, Shlomi Fish shlo...@gmail.com wrote: 1. Will the CPAN testing and downloading toolchian will handle modules uploaded as .tar.bz2? (Allow to install them, unpack them, etc.) How about tar.xz. .bz2, yes. .xz, possibly, but not reliably. CPANPLUS uses Archive::Extract, which can handle .xz if there are xz binaries installed. However, CPAN::DistnameInfo is the standard tool for identifying distribution metadata from a tarball filename and last I checked, it doesn't support .xz extensions, so you'll confuse things that depend on it. 2. Can I easily pack archives into tar.bz2 or tar.xz using Module-Build and/or Module-Install ? Not natively. You would need to subclass make_tarball. -- David
Re: Using a better compression than .gz for one's CPAN modules
The savings for going to .bz2 over .gz for source code are fairly insignificant. We're talking about source code for a perl module. Is your stuff tens of megabytes in size? That's a lot of code if so. I could understand if you were distributing a sizable database with your code but source code, even 100KLOC? Once you go to .gz you're already at better than 2:1. What are you going to save by going to even 3:1, 10Kbytes? compared to the nuisance inflicted, it's nothing.
Re: Using a better compression than .gz for one's CPAN modules
On Fri, November 19, 2010 2:18 pm, dhu...@hudes.org wrote: The savings for going to .bz2 over .gz for source code are fairly insignificant. We're talking about source code for a perl module. Is your stuff tens of megabytes in size? That's a lot of code if so. I could understand if you were distributing a sizable database with your code but source code, even 100KLOC? Once you go to .gz you're already at better than 2:1. What are you going to save by going to even 3:1, 10Kbytes? compared to the nuisance inflicted, it's nothing. Over the entire CPAN archive, it'd be significant... I agree on the individual case it's probably not worth worrying about too much. But if it's easy to use .bz2 or something better it wouldn't hurt to get that word out. (And it may be worth making it easy, though I'm not sure about that.) Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---
Re: Using a better compression than .gz for one's CPAN modules
source code, even 100KLOC? Once you go to .gz you're already at better than 2:1. What are you going to save by going to even 3:1, 10Kbytes? compared to the nuisance inflicted, it's nothing. Over the entire CPAN archive, it'd be significant... I agree on the individual case it's probably not worth worrying about too much. But if it's easy to use .bz2 or something better it wouldn't hurt to get that word out. (And it may be worth making it easy, though I'm not sure about that.) Daniel T. Staal Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between mirrors. Compressing to .bz2 won't help that: the stress is doing a stat on every single file in CPAN not the transfer. Work toward optimizing the mirror distribution instead of worrying about bz2 vs gz. Remember not everyone is on UNIX or UNIX-like: Windows users use CPAN also and AFAIK Windows doesn't understand .bz2 -- certainly not .xz . If it is desirable to achieve better disk space utilization filesystem-level dynamic compression is an option at the expense of additional CPU/memory resource for accessing the content (with the possible gain of more data from the IO channel by getting 2-3 blocks for a 1 block read). Overall, the past consensus has been that the rsync is the best available method but is a heavy burden on the systems. Work toward improvement was started by I think Andreas. I have to get a chance to look at that code...
Re: Using a better compression than .gz for one's CPAN modules
On Fri, Nov 19, 2010 at 08:53:12PM +0200, Shlomi Fish wrote: here is a report on compressing Graph-Easy-0.70.tar with various compression methods: -rw-r--r-- 1 shlomif shlomif 416916 Nov 14 22:23 Graph-Easy-0.70.tar.gz -rw-r--r-- 1 shlomif shlomif 329197 Nov 5 12:24 Graph-Easy-0.70.tar.bz2 -rw-r--r-- 1 shlomif shlomif 270796 Nov 14 22:21 Graph-Easy-0.70.tar.lrz -rw-r--r-- 1 shlomif shlomif 312844 Nov 5 12:24 Graph-Easy-0.70.tar.xz As one can see, there are significant savings in size (and bandwidth) by switching to .bz2 and .xz. .lrz (see http://ck.kolivas.org/apps/lrzip/ ) yields even more in its ZPaq preset, but at the cost of longer compression and even decompression times, so it's not preferable. My question is: 1. Will the CPAN testing and downloading toolchian will handle modules uploaded as .tar.bz2? (Allow to install them, unpack them, etc.) How about tar.xz. Even if it does, there's not much point. bzip2 support is nowhere near universal, and preventing lots of users from using your code would seem to be a poor trade-off for saving an insignificant number of bytes. The *backpan* is so small compared to modern storage that I don't bother with a minicpan any more, I just carry a backpan plus indices around with me all the time on a bit of plastic the size of a postage stamp. As for the others, I've never heard of them. FWIW, there are 166 bzip2 files in my backpan mirror, at least some of which have test results, so yes, the toolchain appears to work for them. The one I bothered to check is also indexed on search.cpan.org, so that important part of the toolchain appears to work with it too. -- David Cantrell | Hero of the Information Age Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt
Re: Using a better compression than .gz for one's CPAN modules
On Fri, 19 Nov 2010 11:57 -0800, dhu...@hudes.org wrote: Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between mirrors. Compressing to .bz2 won't help that: the stress is doing a stat on every single file in CPAN not the transfer. Work toward optimizing the mirror distribution instead of worrying about bz2 vs gz. Remember not everyone is on UNIX or UNIX-like: Windows users use CPAN also and AFAIK Windows doesn't understand .bz2 -- certainly not .xz . Windows itself doesn't. I can't speak for any OTHER perl distribution on Windows, but Strawberry Perl has been including modules that handle .bz2 since before the July 2009 first .msi release, and the 32-bit versions of the July 2010 includes .xz-handling modules, as well, if I recall correctly (I can't recall right now why they fail on 64-bit so far, but I know they do.) --Curtis Jewell -- Curtis Jewell csjew...@cpan.org http://csjewell.dreamwidth.org/ p...@csjewell.fastmail.us http://csjewell.comyr.org/perl/ Your random numbers are not that random -- perl-5.10.1.tar.gz/util.c Strawberry Perl for Windows betas: http://strawberryperl.com/beta/
Re: Using a better compression than .gz for one's CPAN modules
On Fri, November 19, 2010 2:57 pm, dhu...@hudes.org wrote: Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between mirrors. Compressing to .bz2 won't help that: the stress is doing a stat on every single file in CPAN not the transfer. Work toward optimizing the mirror distribution instead of worrying about bz2 vs gz. Remember not everyone is on UNIX or UNIX-like: Windows users use CPAN also and AFAIK Windows doesn't understand .bz2 -- certainly not .xz . Oh, agreed. Just saying that if it already works and doesn't cause problems, it's not a completely useless optimization. But it's definitely at the level of a micro-optimization, and worth about as much. Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---