Scott Dial wrote: > On 6/30/2010 2:53 PM, Barry Warsaw wrote: >> It might be amazing, but it's still a significant overhead. As I've >> described, multiply that by all the py files in all the distro packages >> containing Python source code, and then still try to fit it on a CDROM. > > I decided to prove to myself that it was not a significant issue to have > parallel directory structures in a .tar.bz2, and I was surprised to find > it much worse at that then I had imagined. For example, > > # cd /usr/lib/python2.6/site-packages > # tar --exclude="*.pyc" --exclude="*.pyo" \ > -cjf mercurial.tar.bz2 mercurial > # du -h mercurial.tar.bz2 > 640K mercurial.tar.bz2 > > # cp -a mercurial mercurial2 > # tar --exclude="*.pyc" --exclude="*.pyo" \ > -cjf mercurial2.tar.bz2 mercurial mercurial2 > # du -h mercurial.tar.bz2 > 1.3M mercurial2.tar.bz2 >
I believe the standard (and largest) block size for .bz2 is 900kB, and I *think* that is uncompressed. Though I know that bz2 can chain, since it can compress all NULL bytes extremely well (multiple GB down to kB, IIRC). There was a question as to whether LZMA would do better here, I'm using 7zip, but .xz should perform similarly. $ du -sh mercurial* 2.6M mercurial 2.6M mercurial2 366K mercurial.tar.bz2 734K mercurial2.tar.bz2 303K mercurial.7z 310K mercurial2.7z So LZMA with the 'normal' compression has a big enough window to find almost all of the redundancy, and 310kB is certainly a very small increase over the 303kB. And clearly bz2 does not, since 734kB is actually slightly more than 2x 366kB. John =:-> _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com