Jeff Shannon wrote:

Chris Lasher wrote:

And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.


This 105MB FASTA file is 8.3 MB gzip-ed.


And a 4:1 packed-format file would be ~26MB. It'd be interesting to see how that packed-format file would compress, but I don't care enough to write a script to convert the FASTA file into a packed-format file to experiment with... ;)

If your compression algorithm's any good then both, when compressed, should be approximately equal in size, since the size should be determined by the information content rather than the representation.

Short version, then, is that yes, size concerns (such as they may be) are outweighed by speed and conceptual simplicity (i.e. avoiding a huge mess of bit-masking every time a single base needs to be examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries).... That gets ugly real fast.)

Right!

regards
 Steve
--
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to