Hey, Christian! Sorry for replying to an old e-mail. I've rewritten biostockholm using two layers, one that parses ByteStrings into Events and another one that parses Events into an Stockholm data structure as before. An Event is something very simple, akin to what a SAX parser would give you.
As an example, suppose that you wanted to take only the accession codes of the families on a gzipped Stockholm file. With the new biostockholm library, it's as simple as: {-# LANGUAGE OverloadedStrings #-} import Bio.Sequence.Stockholm.Stream (parseEvents, Event(..)) import Data.Conduit (runResourceT, ($$), (=$)) import Data.Conduit.Zlib (ungzip) import System.Environment (getArgs) import qualified Data.Conduit.Binary as CB import qualified Data.Conduit.List as CL main :: IO () main = getArgs >>= \[input] -> runResourceT (pipeline input) where pipeline input = CB.sourceFile input $$ ungzip =$ parseEvents =$ CL.mapM_ printName printName (EvGF "AC" name) = print name printName _ = return () Running this program with Rfam 9.1's full Stockholm file takes 136s, uses 2 MiB of memory and has an alloc rate of 1.7 GB/s on my machine =). This translates to 12 MiB/s of uncompressed data or 1.3 MiB/s of compressed data. (Of course an special parser looking for '#=GF AC' would be much faster, but the point is that the new streaming parser is fast enough.) I'll do a proper release on Hackage with an announcement here later, but you may grab the code right now with "darcs get http://patch-tag.com/r/felipe/biostockholm". Cheers! =D On Sat, Dec 3, 2011 at 6:11 PM, Christian Höner zu Siederdissen <choe...@tbi.univie.ac.at> wrote: > I've played with this a bit. There are two possibilities, either fast > parsing and high memory overhead (somewhere in the range of 3-?x) or > slow parsing and very low overhead -- mostly determined by either > repeatedly copying strict bytestrings or concatenating lazy ones. > > I'm currently thinking the latter is better. Consider the Rfam-10.1 tRNA > model: 1,542,5748 lines and 1,447,813,039 byte. This can be parsed in > 927 seconds at 2236 Mbyte total memory in use according to the "-s" > option. > > Obviously, the parser is not complete. Right now, you get all of the > key/value pairs in a map. A second map of all the sequence names to INT > identifiers and a vector of bytestrings with the sequence information, > which can be indexed by the sequence-name map. > > (And in the 2236 mbyte, there is some cheating hidden in there, as the > whitespace bytes between sequence name and sequence data can be > dropped). > > The parser is one-pass and works off compressed files which can be huge > win compared to mmap (which requires uncompressed files), albeit with > longer runtimes. I do prefer compressed data, however and would rather > have all Infernal-related stuff have gzip decompression built-in (which > is mostly true for BiobaseInfernal). > > Furthermore, one-pass statistics can be done in constant speed and > incredibly low memory. We are mainly bound by disk speed in that case. > > What I currently don't know is how often we'd actually need full.gz > data, hence if it makes sense to put much time into this. I'll clean > everything up and put it on github, anyway. > > > Gruss, > Christian > > * Felipe Almeida Lessa <felipe.le...@gmail.com> [01.12.2011 17:01]: >> On Thu, Dec 1, 2011 at 12:56 PM, Christian Höner zu Siederdissen >> <choe...@tbi.univie.ac.at> wrote: >> > ok a very simple system parses Rfam-9.1.full.gz in 2800 Mbyte and 42 >> > seconds. What you get is each "STOCKHOLM" to "//" range as a list of >> > bytestring lines. That is still "kind of" suboptimal and I'd like to >> > change some stuff. If in any way possible, I'd like to get memory >> > consumption down to the number of bytes you need for the full model plus >> > a small overhead. >> > >> > I think, there will always be some overhead due to late garbage >> > collection, but if we can parse the complete Rfam.full.gz in less than, >> > say 4 gbyte, it would be extremely cool. >> >> I've changed my test program to use mmap'ed files. The memory usage >> is almost the same. I've attached a heap profile. >> >> So, almost 2/3 of the memory is spent with 'Chunk's and almost 1/3 of >> the memory are lists. In biostockholm I represent unfinished parts of >> the Stockholm file as a list of chunks, before concat'enating them. >> Given that with that mmap the input is a single big bytestring, *all* >> of these Chunks are of the form "Chunk something Empty", which means >> that they use only 2 words (IIRC) more than a strict bytestring would. >> A strict bytestring uses 9 words [1]. Each (:) takes at least 3 >> words. So each part of a sequence uses at least 14 words, which on my >> machine means 112 bytes. >> >> In sum, the problem then seems to be the fact that Rfam's files break >> sequences on something like 60 or 70 columns. That means that even >> with an mmap'ed input we need to keep a list of pointers to every >> single piece. Unfortunately, those "pointers" are very costly. If >> you really need something fast here, you'd need to do something like: >> >> a) Stop-gap measure: change biostockholm to strictly append strict >> ByteStrings, using O(n²) time copying them but using only a constant >> amount of overhead per sequence. >> >> b) Clever solution: use a two-pass algorithm when reading families. >> On a first pass, calculate the length of the multiple alignment. On >> the second pass, allocate mutable buffers with the correct size and >> then copy the contents there. >> >> Cheers, >> >> [1] >> http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html >> >> -- >> Felipe. > > > -- Felipe. _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell