Re: Compressed Pristines (Design Doc)

Philip Martin Thu, 22 Mar 2012 16:00:46 -0700

Erik Huelsmann <ehu...@gmail.com> writes:

> As the others, I'm surprised we seem to be going with a custom file format.
> You claim source files are generally small in size and hence only small
> benefits can be had from compressing them, if at all, due to the fact that
> they would be of sub-block size already.


I was surprised too, so I looked at GCC where a trunk checkout has
75,000 files of various types:

$ find .svn/pristine -type f | wc -l
75192

Uncompressed:

$ du -hs .svn/pristine
635M    .svn/pristine
$ find .svn/pristine -type f | xargs ls -ls | awk '{tot += $1} END {print tot}'
641536

Individually compressed is smaller by a factor of 2:

$ find .svn/pristine -type f | xargs gzip
$ du -hs .svn/pristine
367M    .svn/pristine
$ find .svn/pristine -type f | xargs ls -ls | awk '{tot += $1} END {print tot}'
365624

As one single file is smaller by another factor of 3:

$ find .svn/pristine -type f | xargs cat >> one-big-file
$ du -hs one-big-file
122M    one-big-file
$ ls -ls one-big-file | awk '{print $1}'
124516

When individually compressed most of the 75,000 files are less
than 4K:

$ find .svn/pristine -size -4096c | wc -l
71571

more than half are less than 1K:

$ find .svn/pristine -size -1024c | wc -l
53707

and nearly half are less than 0.5K:

$ find .svn/pristine -size -512c | wc -l
36521

In the uncompressed state:

62323 are less than 4K
36648 are less than 1K
21828 are less than 0.5K

Maybe GCC is not typical but, rather to my surprise, combining the
compressed files would be a significant improvement.

I also have an httpd trunk checkout (needs cleanup so bigger than
normal):

90M uncompressed
37M individually compressed
23M as one big file

That's more like your figures for Subversion where the major step is
individual compression.

-- 
uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com

Re: Compressed Pristines (Design Doc)

Reply via email to