Re: Using a better compression than .gz for one's CPAN modules

2010-11-28 Thread Aristotle Pagaltzis
* Shlomi Fish shlo...@iglu.org.il [2010-11-26 22:05]:
 In any case, regardless of its age, xz does tend to compress
 better than bz2 and should also be faster.

I know. I heard of it quite early and switched from bzip2 to xz
for my database dumps and mail archives.

That’s not the point of the quote though. New things are always
better in some way. Why else would anyone make them? But things
always exist in a broader context and it is rarely so straight-
forward to find any of them superior on that level.

 That put aside sticking with an older solution may be
 preferable due to the better adoption ratios mentioned by David
 and others, but to quote George Bernard Show: The reasonable
 man adapts himself to the world; the unreasonable one persists
 in trying to adapt the world to himself. Therefore all progress
 depends on the unreasonable man..
 ( http://en.wikiquote.org/wiki/George_Bernard_Shaw ).

I agree with the notion. But let me ask how much pressure changing
the compression format on CPAN would exert on the world to adapt
itself to it. Note too the quote is written from the perspective
of the world: no mention goes to the fortunes of the unreasonable
man himself…

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Using a better compression than .gz for one's CPAN modules

2010-11-28 Thread David Golden
On Sun, Nov 28, 2010 at 4:22 PM, Aristotle Pagaltzis pagalt...@gmx.de wrote:
 I agree with the notion. But let me ask how much pressure changing
 the compression format on CPAN would exert on the world to adapt
 itself to it. Note too the quote is written from the perspective
 of the world: no mention goes to the fortunes of the unreasonable
 man himself…

I'm not sure which side you're arguing with that.

Here's how I see it: allowing a new compression format means that
someone will inevitably release a distribution with it that someone
will try to install with an older toolchain that won't handle it.
Based on my prior experience with other such issues, a large portion
of the bug reports, complaints, nasty personal comments and what not
will accrue to the toolchain and its maintainers and not the author
who released the not-backwards-compatible distribution.  Thus, I have
no personal incentive as a toolchain co-maintainer to do the work,
since the only thing I'll get back from it is a hassle.

And since only when a significant fraction of CPAN is released in that
format will the compression benefits add up, the hassles come quick
and the benefits aren't seen for a long time.

On the other hand, if someone wants to recompress all of CPAN into XYZ
compression scheme and release their own client that deals with it (or
patch cpanm or whatever), then they can have the benefits (and any
resulting hassles) themselves.

-- David


Re: Using a better compression than .gz for one's CPAN modules

2010-11-28 Thread Aristotle Pagaltzis
* David Golden xda...@gmail.com [2010-11-28 22:45]:
 On the other hand, if someone wants to recompress all of CPAN
 into XYZ compression scheme and release their own client that
 deals with it (or patch cpanm or whatever), then they can have
 the benefits (and any resulting hassles) themselves.

And note that distributions which ship packages for CPAN modules
are effectively already doing exactly that.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Using a better compression than .gz for one's CPAN modules

2010-11-26 Thread Aristotle Pagaltzis
* Shlomi Fish shlo...@iglu.org.il [2010-11-24 21:05]:
 Welcome to 2010.

There are two kinds of fool. One says,
“This is old, and therefore good.” And one says,
“This is new, and therefore better.”
   —John Brunner

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Using a better compression than .gz for one's CPAN modules

2010-11-26 Thread David Golden
On Fri, Nov 26, 2010 at 3:59 PM, Shlomi Fish shlo...@iglu.org.il wrote:
     There are two kinds of fool. One says,
     “This is old, and therefore good.” And one says,
     “This is new, and therefore better.”

 That put aside sticking with an older solution may be preferable due to the
 better adoption ratios mentioned by David and others, but to quote George
 Bernard Show: The reasonable man adapts himself to the world; the
 unreasonable one persists in trying to adapt the world to himself. Therefore
 all progress depends on the unreasonable man.. (
 http://en.wikiquote.org/wiki/George_Bernard_Shaw ).

Of the many places I choose to be unreasonable for the sake of
progress, squeezing out a little bit more size reduction in tar balls
is not where I'm going be spending my energies.

C.f. 
http://www.dagolden.com/index.php/1148/bootstrapping-cpan-pm-using-httplite/
as well as the abbreviated auto CPAN config and the CPAN Mirror
auto-selection in the current development series of CPAN.pm.

-- David


Re: Using a better compression than .gz for one's CPAN modules

2010-11-25 Thread David Cantrell
On Wed, Nov 24, 2010 at 09:59:59PM +0200, Shlomi Fish wrote:
 On Friday 19 November 2010 22:02:48 David Cantrell wrote:
  Even if it does, there's not much point.  bzip2 support is nowhere near
  universal, and preventing lots of users from using your code would seem
  to be a poor trade-off for saving an insignificant number of bytes.
 One can easily install bzip2 to unpack that the distribution ...

One can indeed easily install it.  Unless one is a Windows user, or is
on a platform which bzip2 doesn't support, or your workplace policies
prevent you from installing it.

  As for the others, I've never heard of them.
 .xz is http://en.wikipedia.org/wiki/Xz .

If I wanted to find out about them I could use google.  I have no
interest in weirdo file formats.

 Welcome to 2010.

Social skills.  You've no doubt heard of them.

-- 
David Cantrell | Official London Perl Mongers Bad Influence

You don't need to spam good porn


Re: Using a better compression than .gz for one's CPAN modules

2010-11-22 Thread Andreas J. Koenig
 On Sat, 20 Nov 2010 23:22:52 +0100, Aristotle Pagaltzis 
 pagalt...@gmx.de said:

   It’s gonna be a lot of work to iron out the entire tool chain to
   support the newer formats; then it will take a lot of time until
   the work trickles out far enough that people could start relying
   on it.

In the case of bzip2 I couldn't resist after having watched bzip2's
acceptance for several years. So I prodded all toolchain authors to
support bz2. It is now done and seems to work fine.

   For quite piddly gains, in absolute numbers.

   I really don’t see the point. Gzip is Good Enough.

Agreed, but since bzip2 support is already done we can welcome it when
people actually use it.

-- 
andreas


Re: Using a better compression than .gz for one's CPAN modules

2010-11-22 Thread Aristotle Pagaltzis
* Andreas J. Koenig andreas.koenig.7os6v...@franz.ak.mind.de [2010-11-22 
09:20]:
 Agreed, but since bzip2 support is already done we can welcome
 it when people actually use it.

I am unwilling to encourage it but I won’t argue if someone wants
to use it. And it can be a win for distributions with very large
bundled data files so one might as well use it for them since the
support exists. I just don’t want to see a campaign against gzip.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)

2010-11-22 Thread David Landgren

On 19/11/2010 20:57, dhu...@hudes.org wrote:

source code, even 100KLOC? Once you go to .gz you're already at better
than 2:1. What are you going to save by going to even 3:1, 10Kbytes?
compared to the nuisance inflicted, it's nothing.


Over the entire CPAN archive, it'd be significant...

I agree on the individual case it's probably not worth worrying about too
much.  But if it's easy to use .bz2 or something better it wouldn't hurt
to get that word out.  (And it may be worth making it easy, though I'm not
sure about that.)

Daniel T. Staal


Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between
mirrors. Compressing to .bz2 won't help that: the stress is doing a stat
on every single file in CPAN not the transfer. Work toward optimizing the
mirror distribution instead of worrying about bz2 vs gz.  Remember not


Yeah, this is the killer. In an ideal world, we would kill the symlinks 
such as authors/id/*, modules/by-category/*, modules/by-module/* and so 
on. These could be recreated via shell scripts locally on mirrors for 
people who wish to maintain these legacies. Cutting that out would 
diminish the rsync burden considerably.


David

--
There's bum trash in my hall and my place is ripped
I've totaled another amp, I'm calling in sick


Re: Reducing rsync cost (was: Re: Using a better compression than .gz for one's CPAN modules)

2010-11-22 Thread David Nicol
On Mon, Nov 22, 2010 at 4:37 AM, David Landgren da...@landgren.net wrote:
 Yeah, this is the killer. In an ideal world, we would kill the symlinks such
 as authors/id/*, modules/by-category/*, modules/by-module/* and so on. These
 could be recreated via shell scripts locally on mirrors for people who wish
 to maintain these legacies. Cutting that out would diminish the rsync burden
 considerably.

 David

or re-engineer CPAN as a sqlite+FTSE database, and re-engineer the
mirroring process as a database mirror via a TBD compact database diff
protocol (I have no intention of doing any of this myself; good
morning)

-- 
It is merely a matter of persistence. -- Albert Camus


Re: Using a better compression than .gz for one's CPAN modules

2010-11-20 Thread Aristotle Pagaltzis
* Shlomi Fish shlo...@gmail.com [2010-11-19 19:55]:
 here is a report on compressing Graph-Easy-0.70.tar with various
 compression methods:

 {{{
 shlomif:~/progs/perl/cpan/Graph/Easy/trunk/Graph-Easy/TEMP$ ls -l
 total 3420
 -rw-r--r-- 1 shlomif shlomif 2160640 Nov 14 22:20 Graph-Easy-0.70.tar
 -rw-r--r-- 1 shlomif shlomif  329197 Nov  5 12:24 Graph-Easy-0.70.tar.bz2
 -rw-r--r-- 1 shlomif shlomif  416916 Nov 14 22:23 Graph-Easy-0.70.tar.gz
 -rw-r--r-- 1 shlomif shlomif  270796 Nov 14 22:21 Graph-Easy-0.70.tar.lrz
 -rw-r--r-- 1 shlomif shlomif  312844 Nov  5 12:24 Graph-Easy-0.70.tar.xz
 }}}

 As one can see, there are significant savings in size (and
 bandwidth) by switching to .bz2 and .xz.

Where does one see that? I see some savings, but not significant
ones. You drop from 2 MB to 400 kb by using gzip, then a further
100 to 150 kb by using more unusual compression programs. Just
going to http://search.cpan.org/dist/Graph-Easy/ will pull down
more data than you just saved.

The initial savings is worthwhile, but the additional gains?

The era of 28.8 modems is long past. (And even in areas where
internet connectivity is bad, bandwidth is not the limiting
factor. You go from cell phone with data plan to satellite
internet to CD-ROMs delivered by truck: the scarce resource
becomes latency, not the bandwidth at any one instant.)

Gzip has 100% installed base. Even bzip2 does way worse; it has
100% installed base if you are looking at Linux and the 386BSD
family, but is way less commonplace elsewhere, esp. Windows. And
the other tools are only just making inroads on Linux. How long
until they’re as widespread as bzip2? How long until bzip2 is as
widespread as gzip?

How large is the total CPAN archive – 10 GB? Re-compressing all
of it now would yield a benefit of what, 3 GB? 4? Even 5 maybe?
As Dave said, it fits on a thumb drive already. And we’re not
even talking about re-compressing here, just about future support
for new distributions.

It’s gonna be a lot of work to iron out the entire tool chain to
support the newer formats; then it will take a lot of time until
the work trickles out far enough that people could start relying
on it.

For quite piddly gains, in absolute numbers.

I really don’t see the point. Gzip is Good Enough.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Using a better compression than .gz for one's CPAN modules

2010-11-20 Thread Dana Hudes
While I completely agree with Aristotle I wish to clarify that Solaris 10 and 
11 ship with bzip2. I can't recall about Solaris 9 and I am recalling this was 
not the case with 8 and earlier. 

Sent from my BlackBerry® smartphone with Nextel Direct Connect

Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread Shlomi Fish
Hi all,

here is a report on compressing Graph-Easy-0.70.tar with various
compression methods:

{{{
shlomif:~/progs/perl/cpan/Graph/Easy/trunk/Graph-Easy/TEMP$ ls -l
total 3420
-rw-r--r-- 1 shlomif shlomif 2160640 Nov 14 22:20 Graph-Easy-0.70.tar
-rw-r--r-- 1 shlomif shlomif  329197 Nov  5 12:24 Graph-Easy-0.70.tar.bz2
-rw-r--r-- 1 shlomif shlomif  416916 Nov 14 22:23 Graph-Easy-0.70.tar.gz
-rw-r--r-- 1 shlomif shlomif  270796 Nov 14 22:21 Graph-Easy-0.70.tar.lrz
-rw-r--r-- 1 shlomif shlomif  312844 Nov  5 12:24 Graph-Easy-0.70.tar.xz
}}}

As one can see, there are significant savings in size (and bandwidth)
by switching to .bz2 and .xz. .lrz (see
http://ck.kolivas.org/apps/lrzip/ ) yields even more in its ZPaq
preset, but at the cost of longer compression and even decompression
times, so it's not preferable. My question is:

1. Will the CPAN testing and downloading toolchian will handle modules
uploaded as .tar.bz2?  (Allow to install them, unpack them, etc.)  How
about tar.xz.

2. Can I easily pack archives into tar.bz2 or tar.xz using
Module-Build and/or Module-Install ?

Regards,

-- Shlomi Fish
-- 
--
Shlomi Fish http://www.shlomifish.org/

Electrical Engineering studies. In the Technion. Been there. Done
that. Forgot a lot. Remember too much.


Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread David Golden
On Fri, Nov 19, 2010 at 1:53 PM, Shlomi Fish shlo...@gmail.com wrote:
 1. Will the CPAN testing and downloading toolchian will handle modules
 uploaded as .tar.bz2?  (Allow to install them, unpack them, etc.)  How
 about tar.xz.

.bz2, yes.  .xz, possibly, but not reliably.  CPANPLUS uses
Archive::Extract, which can handle .xz if there are xz binaries
installed.

However, CPAN::DistnameInfo is the standard tool for identifying
distribution metadata from a tarball filename and last I checked, it
doesn't support .xz extensions, so you'll confuse things that depend
on it.

 2. Can I easily pack archives into tar.bz2 or tar.xz using
 Module-Build and/or Module-Install ?

Not natively.  You would need to subclass make_tarball.

-- David


Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread dhudes
The savings for going to .bz2 over .gz for source code are fairly
insignificant.  We're talking about source code for a perl module.  Is
your stuff tens of megabytes in size? That's a lot of code if so. I could
understand if you were distributing a sizable database with your code but
source code, even 100KLOC? Once you go to .gz you're already at better
than 2:1. What are you going to save by going to even 3:1, 10Kbytes?
compared to the nuisance inflicted, it's nothing.



Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread Daniel Staal

On Fri, November 19, 2010 2:18 pm, dhu...@hudes.org wrote:
 The savings for going to .bz2 over .gz for source code are fairly
 insignificant.  We're talking about source code for a perl module.  Is
 your stuff tens of megabytes in size? That's a lot of code if so. I could
 understand if you were distributing a sizable database with your code but
 source code, even 100KLOC? Once you go to .gz you're already at better
 than 2:1. What are you going to save by going to even 3:1, 10Kbytes?
 compared to the nuisance inflicted, it's nothing.

Over the entire CPAN archive, it'd be significant...

I agree on the individual case it's probably not worth worrying about too
much.  But if it's easy to use .bz2 or something better it wouldn't hurt
to get that word out.  (And it may be worth making it easy, though I'm not
sure about that.)

Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---



Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread dhudes
 source code, even 100KLOC? Once you go to .gz you're already at better
 than 2:1. What are you going to save by going to even 3:1, 10Kbytes?
 compared to the nuisance inflicted, it's nothing.

 Over the entire CPAN archive, it'd be significant...

 I agree on the individual case it's probably not worth worrying about too
 much.  But if it's easy to use .bz2 or something better it wouldn't hurt
 to get that word out.  (And it may be worth making it easy, though I'm not
 sure about that.)

 Daniel T. Staal

Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between
mirrors. Compressing to .bz2 won't help that: the stress is doing a stat
on every single file in CPAN not the transfer. Work toward optimizing the
mirror distribution instead of worrying about bz2 vs gz.  Remember not
everyone is on UNIX or UNIX-like: Windows users use CPAN also and AFAIK
Windows doesn't understand .bz2 -- certainly not .xz .

If it is desirable to achieve better disk space utilization
filesystem-level dynamic compression is an option at the expense of
additional CPU/memory resource for accessing the content (with the
possible gain of more data from the IO channel by getting 2-3 blocks for a
1 block read).
Overall, the past consensus has been that the rsync is the best available
method but is a heavy burden on the systems.
Work toward improvement was started by I think Andreas. I have to get a
chance to look at that code...




Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread David Cantrell
On Fri, Nov 19, 2010 at 08:53:12PM +0200, Shlomi Fish wrote:

 here is a report on compressing Graph-Easy-0.70.tar with various
 compression methods:
 
 -rw-r--r-- 1 shlomif shlomif  416916 Nov 14 22:23 Graph-Easy-0.70.tar.gz

 -rw-r--r-- 1 shlomif shlomif  329197 Nov  5 12:24 Graph-Easy-0.70.tar.bz2
 -rw-r--r-- 1 shlomif shlomif  270796 Nov 14 22:21 Graph-Easy-0.70.tar.lrz
 -rw-r--r-- 1 shlomif shlomif  312844 Nov  5 12:24 Graph-Easy-0.70.tar.xz
 
 As one can see, there are significant savings in size (and bandwidth)
 by switching to .bz2 and .xz. .lrz (see
 http://ck.kolivas.org/apps/lrzip/ ) yields even more in its ZPaq
 preset, but at the cost of longer compression and even decompression
 times, so it's not preferable. My question is:
 
 1. Will the CPAN testing and downloading toolchian will handle modules
 uploaded as .tar.bz2?  (Allow to install them, unpack them, etc.)  How
 about tar.xz.

Even if it does, there's not much point.  bzip2 support is nowhere near
universal, and preventing lots of users from using your code would seem
to be a poor trade-off for saving an insignificant number of bytes.
The *backpan* is so small compared to modern storage that I don't bother
with a minicpan any more, I just carry a backpan plus indices around
with me all the time on a bit of plastic the size of a postage stamp.

As for the others, I've never heard of them.

FWIW, there are 166 bzip2 files in my backpan mirror, at least some of
which have test results, so yes, the toolchain appears to work for them.
The one I bothered to check is also indexed on search.cpan.org, so that
important part of the toolchain appears to work with it too.

-- 
David Cantrell | Hero of the Information Age

Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt


Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread Curtis Jewell


On Fri, 19 Nov 2010 11:57 -0800, dhu...@hudes.org wrote:
 Disk space is cheap. Bandwidth is cheap. What's rough is the rsync
 between
 mirrors. Compressing to .bz2 won't help that: the stress is doing a stat
 on every single file in CPAN not the transfer. Work toward optimizing the
 mirror distribution instead of worrying about bz2 vs gz.  Remember not
 everyone is on UNIX or UNIX-like: Windows users use CPAN also and AFAIK
 Windows doesn't understand .bz2 -- certainly not .xz .

Windows itself doesn't. I can't speak for any OTHER perl distribution on
Windows, but Strawberry Perl has been including modules that handle .bz2
since before the July 2009 first .msi release, and the 32-bit versions
of the July 2010 includes .xz-handling modules, as well, if I recall
correctly (I can't recall right now why they fail on  64-bit so far, but
I know they do.) 

--Curtis Jewell
--
Curtis Jewell
csjew...@cpan.org   http://csjewell.dreamwidth.org/
p...@csjewell.fastmail.us   http://csjewell.comyr.org/perl/

Your random numbers are not that random -- perl-5.10.1.tar.gz/util.c

Strawberry Perl for Windows betas: http://strawberryperl.com/beta/



Re: Using a better compression than .gz for one's CPAN modules

2010-11-19 Thread Daniel Staal

On Fri, November 19, 2010 2:57 pm, dhu...@hudes.org wrote:

 Disk space is cheap. Bandwidth is cheap. What's rough is the rsync between
 mirrors. Compressing to .bz2 won't help that: the stress is doing a stat
 on every single file in CPAN not the transfer. Work toward optimizing the
 mirror distribution instead of worrying about bz2 vs gz.  Remember not
 everyone is on UNIX or UNIX-like: Windows users use CPAN also and AFAIK
 Windows doesn't understand .bz2 -- certainly not .xz .

Oh, agreed.  Just saying that if it already works and doesn't cause
problems, it's not a completely useless optimization.  But it's definitely
at the level of a micro-optimization, and worth about as much.

Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---