Hi, ok a very simple system parses Rfam-9.1.full.gz in 2800 Mbyte and 42 seconds. What you get is each "STOCKHOLM" to "//" range as a list of bytestring lines. That is still "kind of" suboptimal and I'd like to change some stuff. If in any way possible, I'd like to get memory consumption down to the number of bytes you need for the full model plus a small overhead.
I think, there will always be some overhead due to late garbage collection, but if we can parse the complete Rfam.full.gz in less than, say 4 gbyte, it would be extremely cool. Code will be available soon ;-) Gruss, Christian * Felipe Almeida Lessa <felipe.le...@gmail.com> [01.12.2011 00:59]: > On Wed, Nov 30, 2011 at 9:54 PM, Christian Höner zu Siederdissen > <choe...@tbi.univie.ac.at> wrote: > > I'll give an extremely simple iteratee-based parser a shot on parsing > > Rfam 10.1 full. It contains at least to huge alignments (tRNA has > > 1 000 000 sequences, i think) and SSU-rRNA is probably bad as well. If I > > can keep the memory consumption slightly above what that is in bytes, > > I'll let you know and we can consider extending that... > > biostockholm successfully parsed Rfam 9.1's tRNA using 900 MiB of memory. > > Good luck implementing something "extremely simple" that reads > Stockholm files =). > > > Iteratees could be helpful as one can discard everything from memory > > that is not explicitly kept -- of course for the test I'll fake "full > > parsing" of individual families. > > > > But I would not have expected to see such bad memory behaviour as you > > are using lazy bytestrings. Maybe putting in "ByteString.copy" would > > help when creating the individual sequences, making sure that the input > > stream can be completely garbage collected. > > I've tried doing this and the memory usage got worse (besides taking > more time). Actuallly, the whole Rfam 9.1 full file is less than 2 > GiB uncompressed, so I don't think this is the issue. I'd need to do > some heap profiles to identify the culprit. > > Cheers, > > -- > Felipe.
pgpyQZ3nt2eKS.pgp
Description: PGP signature
_______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell