On Wed, Nov 30, 2011 at 1:00 PM, Felipe Almeida Lessa <felipe.le...@gmail.com> wrote: > On Wed, Nov 30, 2011 at 12:28 PM, Christian Höner zu Siederdissen > <choe...@tbi.univie.ac.at> wrote: >> No, I didn't see it. I blame it on not having had breakfast yet (and yes >> I'm still in Austria...). I'll take a look at your package. Can you >> parse the complete Rfam.full.gz? If not, maybe I'll write an adaptor for >> iteratee... > > I've never tried, so I just did. Attached is my test program. > > With Rfam 10.1 seed: 41033 sequences, 1.7s, 10 MiB of memory > With Rfam 9.1 seed: 27292 sequences, 1.1s, 9 MiB of memory > With Rfam 9.1 full: stack overflow =D. > > So the answer right now is "no" =(. Given that it's a stack overflow, > I guess it is a bug somewhere, not something inherent to lazy > processing.
I've found and fixed the stack overflow. I've also reorgnized the internal implementation to use less memory, so both successful tests above use only 5 MiB of memory. These changes are already on Hackage in version 0.1.0.1. Now, when parsing Rfam 9.1's full file, it parses everything correctly until RF00177 (SSU_rRNA_bacteria), which is a big family in both length and number of organisms. When reaching RF00177, it fails without heap memory (I've used +RTS -M3000M). I think that it would be able to parse it given more memory, but I can't do that on my notebook. Now, there are two questions: a) Is it possible to implement the parser using less memory? Maybe, but it's not straightforward (for me, at least =). b) Would an iteratee-based aproach be better? I doubt. The Stockholm format is made for humans, not machines, so information may be scattered all over the place (and it is with Rfam files!). Basically, you can't produce anything useful before parsing the whole family. The best you could do is giving bits of the sequences to the iteratee, but I can't imagine how that would be useful, since the sequences would be interleaved depending on how the Stockholm file was created. Cheers, -- Felipe. _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell