Re: Cello: a library of string algoritms using succinct data structures

bpr Thu, 06 Apr 2017 17:15:05 +0200

@andrea Yes, I think you understand me, the alphabet of DNA and RNA can and 
should be represented with two bits per character. There's a 
['standard'](https://genome.ucsc.edu/goldenpath/help/twoBit.html) for storing 
FASTA files as .2bit files for compression, but I am befuddled as to why they 
chose T-00, C-01, A-10, G-11. If they chose A-T and C-G to be bitwise 
complements of each other then certain operations become much simpler (e.g., 
[you can reverse complement a kmer stored in a 32 or 64 bit value looplessly 
with bitops](https://github.com/bpr/bio/blob/master/src/seq/kmers.nim)) and 
just makes more sense. I use A-00, C-01, G-10, T-11 which is easy to remember 
because of order.


I'll also at that while the thesis you point to is good and interesting, some 
things have changed since 2012 and I suggest you look at 
[this](https://dazzlerblog.wordpress.com/tag/poisson-sampling/) for some other 
perspective on where genome assembly is headed, with much longer and noisier 
reads.

Re: Cello: a library of string algoritms using succinct data structures

Reply via email to