Hi, I've played with this a bit. There are two possibilities, either fast parsing and high memory overhead (somewhere in the range of 3-?x) or slow parsing and very low overhead -- mostly determined by either repeatedly copying strict bytestrings or concatenating lazy ones.
I'm currently thinking the latter is better. Consider the Rfam-10.1 tRNA model: 1,542,5748 lines and 1,447,813,039 byte. This can be parsed in 927 seconds at 2236 Mbyte total memory in use according to the "-s" option. Obviously, the parser is not complete. Right now, you get all of the key/value pairs in a map. A second map of all the sequence names to INT identifiers and a vector of bytestrings with the sequence information, which can be indexed by the sequence-name map. (And in the 2236 mbyte, there is some cheating hidden in there, as the whitespace bytes between sequence name and sequence data can be dropped). The parser is one-pass and works off compressed files which can be huge win compared to mmap (which requires uncompressed files), albeit with longer runtimes. I do prefer compressed data, however and would rather have all Infernal-related stuff have gzip decompression built-in (which is mostly true for BiobaseInfernal). Furthermore, one-pass statistics can be done in constant speed and incredibly low memory. We are mainly bound by disk speed in that case. What I currently don't know is how often we'd actually need full.gz data, hence if it makes sense to put much time into this. I'll clean everything up and put it on github, anyway. Gruss, Christian * Felipe Almeida Lessa <felipe.le...@gmail.com> [01.12.2011 17:01]: > On Thu, Dec 1, 2011 at 12:56 PM, Christian Höner zu Siederdissen > <choe...@tbi.univie.ac.at> wrote: > > ok a very simple system parses Rfam-9.1.full.gz in 2800 Mbyte and 42 > > seconds. What you get is each "STOCKHOLM" to "//" range as a list of > > bytestring lines. That is still "kind of" suboptimal and I'd like to > > change some stuff. If in any way possible, I'd like to get memory > > consumption down to the number of bytes you need for the full model plus > > a small overhead. > > > > I think, there will always be some overhead due to late garbage > > collection, but if we can parse the complete Rfam.full.gz in less than, > > say 4 gbyte, it would be extremely cool. > > I've changed my test program to use mmap'ed files. The memory usage > is almost the same. I've attached a heap profile. > > So, almost 2/3 of the memory is spent with 'Chunk's and almost 1/3 of > the memory are lists. In biostockholm I represent unfinished parts of > the Stockholm file as a list of chunks, before concat'enating them. > Given that with that mmap the input is a single big bytestring, *all* > of these Chunks are of the form "Chunk something Empty", which means > that they use only 2 words (IIRC) more than a strict bytestring would. > A strict bytestring uses 9 words [1]. Each (:) takes at least 3 > words. So each part of a sequence uses at least 14 words, which on my > machine means 112 bytes. > > In sum, the problem then seems to be the fact that Rfam's files break > sequences on something like 60 or 70 columns. That means that even > with an mmap'ed input we need to keep a list of pointers to every > single piece. Unfortunately, those "pointers" are very costly. If > you really need something fast here, you'd need to do something like: > > a) Stop-gap measure: change biostockholm to strictly append strict > ByteStrings, using O(n²) time copying them but using only a constant > amount of overhead per sequence. > > b) Clever solution: use a two-pass algorithm when reading families. > On a first pass, calculate the length of the multiple alignment. On > the second pass, allocate mutable buffers with the correct size and > then copy the contents there. > > Cheers, > > [1] > http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html > > -- > Felipe.
pgp0SwxwWN3Kl.pgp
Description: PGP signature
_______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell