4 bits per nucleotide will waste too much space. On the other hand, NCBI way 
(http://blast.wustl.edu/blast/ncbi20ntfmt.html) is too complex. Now I am 
thinking of something like this: we encode in batches of 9 bytes. In each 
batch, the 1st byte holds flags for the next 8. If its i-th bit is 1, then the 
i-th byte holds two 4-bit codes, else the i-th byte holds four 2-bit codes. 
Assuming that the degenerate symbols are relatively rare, this scheme will cost 
additional 12% on top of a 2-bit representation of ATCG sequence. The flag bit 
can act upon pairs of bytes and so on to save even more space.

>From: Michael Hoffman  - 2008-02-12 10:44
>You might want to consider allowing the possibility of other IUPAC
>ambiguity codes than N. If you allow N, you already can't store each
>nucleotide in two bits. With four bits, you would be able to store the
>full complement of ambiguity codes (ACGT/WSKMRY/BDHV/N), and can even do
>it in an elegant way where each bit represents one of ACGT. 

_________________________________________________________________
Connect and share in new ways with Windows Live.
http://www.windowslive.com/share.html?ocid=TXT_TAGHM_Wave2_sharelife_012008
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to