Jeff Shannon wrote:
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.
This 105MB FASTA file is 8.3 MB gzip-ed.
And a 4:1 packed-format file would be ~26MB. It'd be interesting to see
how that packed-format file would compress, but I don't care enough to
write a script to convert the FASTA file into a packed-format file to
experiment with... ;)
If your compression algorithm's any good then both, when compressed,
should be approximately equal in size, since the size should be
determined by the information content rather than the representation.
Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a huge
mess of bit-masking every time a single base needs to be examined, or a
human-(semi-)readable display is needed).
(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)
Right!
regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list