Hi,

ok a very simple system parses Rfam-9.1.full.gz in 2800 Mbyte and 42
seconds. What you get is each "STOCKHOLM" to "//" range as a list of
bytestring lines. That is still "kind of" suboptimal and I'd like to
change some stuff. If in any way possible, I'd like to get memory
consumption down to the number of bytes you need for the full model plus
a small overhead.

I think, there will always be some overhead due to late garbage
collection, but if we can parse the complete Rfam.full.gz in less than,
say 4 gbyte, it would be extremely cool.

Code will be available soon ;-)

Gruss,
Christian

* Felipe Almeida Lessa <felipe.le...@gmail.com> [01.12.2011 00:59]:
> On Wed, Nov 30, 2011 at 9:54 PM, Christian Höner zu Siederdissen
> <choe...@tbi.univie.ac.at> wrote:
> > I'll give an extremely simple iteratee-based parser a shot on parsing
> > Rfam 10.1 full. It contains at least to huge alignments (tRNA has
> > 1 000 000 sequences, i think) and SSU-rRNA is probably bad as well. If I
> > can keep the memory consumption slightly above what that is in bytes,
> > I'll let you know and we can consider extending that...
> 
> biostockholm successfully parsed Rfam 9.1's tRNA using 900 MiB of memory.
> 
> Good luck implementing something "extremely simple" that reads
> Stockholm files =).
> 
> > Iteratees could be helpful as one can discard everything from memory
> > that is not explicitly kept -- of course for the test I'll fake "full
> > parsing" of individual families.
> >
> > But I would not have expected to see such bad memory behaviour as you
> > are using lazy bytestrings. Maybe putting in "ByteString.copy" would
> > help when creating the individual sequences, making sure that the input
> > stream can be completely garbage collected.
> 
> I've tried doing this and the memory usage got worse (besides taking
> more time).  Actuallly, the whole Rfam 9.1 full file is less than 2
> GiB uncompressed, so I don't think this is the issue.  I'd need to do
> some heap profiles to identify the culprit.
> 
> Cheers,
> 
> -- 
> Felipe.

Attachment: pgpyQZ3nt2eKS.pgp
Description: PGP signature

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Reply via email to