@andrea Yes, I think you understand me, the alphabet of DNA and RNA can and should be represented with two bits per character. There's a ['standard'](https://genome.ucsc.edu/goldenpath/help/twoBit.html) for storing FASTA files as .2bit files for compression, but I am befuddled as to why they chose T-00, C-01, A-10, G-11. If they chose A-T and C-G to be bitwise complements of each other then certain operations become much simpler (e.g., [you can reverse complement a kmer stored in a 32 or 64 bit value looplessly with bitops](https://github.com/bpr/bio/blob/master/src/seq/kmers.nim)) and just makes more sense. I use A-00, C-01, G-10, T-11 which is easy to remember because of order.
I'll also at that while the thesis you point to is good and interesting, some things have changed since 2012 and I suggest you look at [this](https://dazzlerblog.wordpress.com/tag/poisson-sampling/) for some other perspective on where genome assembly is headed, with much longer and noisier reads.
