4 bits per nucleotide will waste too much space. On the other hand, NCBI way (http://blast.wustl.edu/blast/ncbi20ntfmt.html) is too complex. Now I am thinking of something like this: we encode in batches of 9 bytes. In each batch, the 1st byte holds flags for the next 8. If its i-th bit is 1, then the i-th byte holds two 4-bit codes, else the i-th byte holds four 2-bit codes. Assuming that the degenerate symbols are relatively rare, this scheme will cost additional 12% on top of a 2-bit representation of ATCG sequence. The flag bit can act upon pairs of bytes and so on to save even more space.
>From: Michael Hoffman - 2008-02-12 10:44 >You might want to consider allowing the possibility of other IUPAC >ambiguity codes than N. If you allow N, you already can't store each >nucleotide in two bits. With four bits, you would be able to store the >full complement of ambiguity codes (ACGT/WSKMRY/BDHV/N), and can even do >it in an elegant way where each bit represents one of ACGT. _________________________________________________________________ Connect and share in new ways with Windows Live. http://www.windowslive.com/share.html?ocid=TXT_TAGHM_Wave2_sharelife_012008 ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users