Hello, Ketil Malde! On Tue, Aug 17, 2010 at 8:02 AM, Ketil Malde <ke...@malde.org> wrote: > Ivan Lazar Miljenovic <ivan.miljeno...@gmail.com> writes: > >> Seeing as how the genome just uses 4 base "letters", > > Yes, the bulk of the data is not really "text" at all, but each sequence > (it's fragmented due to the molecular division into chromosomes, and > due to incompleteness) also has a textual header. Generally, the Fasta > format looks like this: > > >sequence-id some arbitrary metadata blah blah > ACGATATACGCGCATGCGAT... > ..lines and lines of letters... > > (As an aside, although there are only four nucleotides (ACGT), there are > occasional wildcard characters, the most common being N for aNy > nucleotide, but there are defined wildcards for all subsets of the alphabet.)
As someone who knows and uses your bio package, I'm almost certain that Text really isn't the right data type for representing everything. Certainly *not* for the genomic data itself. In fact, a representation using 4 bits per base (4 nucleotides plus 12 other characters, such as gaps as aNy) is easy to represent using ByteStrings with two bases per byte and should halve the space requirements. However, the header of each sequence is text, in the sense of human language text, and ideally should be represented using Text. In other words, the sequence data type[1] currently is defined as: type SeqData = Data.ByteString.Lazy.ByteString type QualData = Data.ByteString.Lazy.ByteString data Sequence t = Seq !SeqData !SeqData !(Maybe QualData) [1] http://hackage.haskell.org/packages/archive/bio/0.4.6/doc/html/Bio-Sequence-SeqData.html#t:Sequence where the meaning is that in 'Seq header seqdata qualdata', 'header' would be something like "sequence-id some arbitrary metadata blah blah" and 'seqdata' would be "ACGATATACGCGCATGCGAT". But perhaps we should really have: type SeqData = Data.ByteString.Lazy.ByteString type QualData = Data.ByteString.Lazy.ByteString type HeaderData = Data.Text.Text -- strict is prolly a good choice here data Sequence t = Seq !HeaderData !SeqData !(Maybe QualData) Semantically, this is the right choice, putting Text where there is text. We can read everything with ByteStrings and then use[2] decodeUtf8 :: ByteString -> Text [2] http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html#v:decodeUtf8 only for the header bits. There is only one problem in this approach, UTF-8 for the input FASTA file would be hardcoded. Considering that probably nobody will be using UTF-16 or UTF-32 for the whole FASTA file, there remains only UTF-8 (from which ASCII is just a special case) and other 8-bits encondings (such as ISO8859-1, Shift-JIS, etc.). I haven't seen a FASTA file with characters outside the ASCII range yet, but I guess the choice of UTF-8 shouldn't be a big problem. >> wouldn't it be better to not treat it as text but use something else? > > I generally use ByteStrings, with the .Char8 interface if/when > appropriate. This is actually a pretty good choice; even if people use > Unicode in the headers, I don't particularly want to care - as long as > it is transparent. In some cases, I'd like to, say, search headers for > some specific string - in these cases, a nice, tidy, rich, and optimized > Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly > essential at the moment, since I haven't bothered to test the available > options. I guess for my stuff, the (human consumable) text bits are > neither very performance intensive, nor large, so I could probably and > fairly cheaply wrap relevant operations or fields with Data.Text's > {de,en}codeUtf8. And in practice - partly due to lacking software > support, I'm sure - it's all ASCII anyway. :-) Oh, so I didn't read this paragraph closely enough :). In this e-mail I'm basically agreeing with your thoughts here =). And what do you think about creating a real SeqData data type with two bases per byte? In terms of processing speed I guess there will be a small penalty, but if you need to have large quantities of base pairs in memory this would double your capacity =). Cheers, -- Felipe. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe