On 19 Feb 2003 17:16:48 +1100, Glenn McGrath <[EMAIL PROTECTED]> wrote: >I have looked at PPMd, ive read thats good but it took me a while to >workout how to use it to get maximum ocmpression, here is a comparison. > >8921088 Packages >2375680 Packages.gz >1814528 Packages.bz2 >1335296 Packages.pmd > >That was done with "PPMd e -o16 -m256 -r2 Packages", it surprised me >that the -m option (amount of memory to use) effects the compression >ratio. > >Thats a pretty improvement over bz2.
I would guess that memory usage improving compression is a concept that is widely unknown so I filed a bug against ppmd: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=181665 I'm ccing the bug as in this message I talk about the -m option in this next paragraph. There is a good paper on the PPMII algorithm (based on PPM) which ppmd uses at: http://DataCompression.info/Miscellaneous/PPMII_DCC02.pdf Basically, as I understand it, the more memory that you allow a PPM algorithm to use, the more of the file it can check for similar contexts or come up with better prediction values by checking more of and more in the file. Some interesting stuff about bzip2 form: ftp://sources.redhat.com/pub/bzip2/docs/manual_1.html#SEC1 ----- bzip2 compresses files using the Burrows-Wheeler block-sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors. ----- Also, when using bzip2 note that -9 does not indicate to use the best compression (best as in size), but instead specifies the largest blocksize which *usually* gives the smallest compressed file. Sometimes different block sizes can actually be better than -9. I'd have to test it to be sure, but the case I was testing was compressing several megabytes of IRC logs which are ASCII text with many repeating words and a defined structure. The reference fwiw: ftp://sources.redhat.com/pub/bzip2/docs/manual_2.html#SEC6 ----- -1 (or --fast) to -9 (or --best) Set the block size to 100 k, 200 k .. 900 k when compressing. ----- Perhaps I should make a wishlist bug against bzip2 about this option's documentation. --best could also be said to be short for best guess at blocksize which uses the 900 k block size. Also people using bzip2 might need to be explicitly informed. Perhaps all bzip2'd files in Debian should be checked to see what blocksize is best and file bugs against the few that don't have optimal blocksizes chosen. I might warn of a mass bug filing in debian-devel if I decide I want to test all relevant bz2 files. It should also be noted that bzip2 may not be an optimal implementation of the Burrows-Wheeler block-sorting algorithm. In ftp://sources.redhat.com/pub/bzip2/docs/manual_4.html#SEC44 Julian Seward says: "It would be fair to say that the bzip2 format was frozen before I properly and fully understood the performance consequences of doing so." The Archive Comparison Test (ACT) (usually available at http://compression.ca ) has a comparison of archivers. You can find the algorithm that each archiver uses in the Index of archivers. I found the best compressors for size use Prediction by Partial Match (PPM which is based on Markov Chains) and Context Tree Weighting (CTW) algorithms. Another thing to note when creating compressed *archives* is that the order of files may be important. It has been documented (I don't remember where) that grouping files of similar types together can yield better compression. It might yield optimum compression to go through all the different ways of ordering files. I suspect this is how the 7-zip ( http://www.7-zip.org ) algorithm outperforms other zip algorithms and still maintains compatibility with all other zip archivers. A program or script to put files into different orders for archiving (in a list or in a list passed to something like tar) is something that I have been meaning to do. Going through all possible combinations would be nice, but it may be faster to use something like file to group like files together. File has quite a few bugs against it though ( http://bugs.debian.org/file ) and it may be far from an optimal choice. Certainly using file would be a good place for me to start. It'd be nice if someone wrote such a script before me though. ;-) Drew Daniels -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

