Jacob Berv noted:

>
> I noticed today that the compression ratio for an interleaved phylip file 
> (zip compressed) was about 84:1, (390MB uncompressed —> 4.6MB compressed) 
> whereas the compression ratio for the same data non-interleaved was a much 
> worse 3.4:1 (390 MB uncompressed —> 113.9 MB). Not knowing much about how zip 
> compression actually works - I thought this might be an interesting 
> observation for the group…


Interleaved sequences have blocks of (say) 50 bases.  Successive lines
may repeat a whole block or nearly repeat it.  I wonder whether that
makes the interleaved format easier to compress.

I would guess that the compressibility of interleaved sequences would
be highest when the sequences are closely related.  In that case there
would be 50-base blocks of nearly identical sequences.  With less
closely related sequences the compressibility should be much lower.

Joe
----
Joe Felsenstein         j...@gs.washington.edu
 Department of Genome Sciences and Department of Biology,
 University of Washington, Box 355065, Seattle, WA 98195-5065 USA

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to