Chris Lasher wrote:

And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB. It'd be interesting to see how that packed-format file would compress, but I don't care enough to write a script to convert the FASTA file into a packed-format file to experiment with... ;)


Short version, then, is that yes, size concerns (such as they may be) are outweighed by speed and conceptual simplicity (i.e. avoiding a huge mess of bit-masking every time a single base needs to be examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries).... That gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to