On Wed, Nov 30, 2011 at 1:00 PM, Felipe Almeida Lessa
<felipe.le...@gmail.com> wrote:
> On Wed, Nov 30, 2011 at 12:28 PM, Christian Höner zu Siederdissen
> <choe...@tbi.univie.ac.at> wrote:
>> No, I didn't see it. I blame it on not having had breakfast yet (and yes
>> I'm still in Austria...). I'll take a look at your package. Can you
>> parse the complete Rfam.full.gz? If not, maybe I'll write an adaptor for
>> iteratee...
>
> I've never tried, so I just did.  Attached is my test program.
>
> With Rfam 10.1 seed: 41033 sequences, 1.7s, 10 MiB of memory
> With Rfam 9.1 seed: 27292 sequences, 1.1s, 9 MiB of memory
> With Rfam 9.1 full: stack overflow =D.
>
> So the answer right now is "no" =(.  Given that it's a stack overflow,
> I guess it is a bug somewhere, not something inherent to lazy
> processing.

I've found and fixed the stack overflow.  I've also reorgnized the
internal implementation to use less memory, so both successful tests
above use only 5 MiB of memory.  These changes are already on Hackage
in version 0.1.0.1.

Now, when parsing Rfam 9.1's full file, it parses everything correctly
until RF00177 (SSU_rRNA_bacteria), which is a big family in both
length and number of organisms.  When reaching RF00177, it fails
without heap memory (I've used +RTS -M3000M).  I think that it would
be able to parse it given more memory, but I can't do that on my
notebook.

Now, there are two questions:

  a) Is it possible to implement the parser using less memory?  Maybe,
but it's not straightforward (for me, at least =).

  b) Would an iteratee-based aproach be better?  I doubt.  The
Stockholm format is made for humans, not machines, so information may
be scattered all over the place (and it is with Rfam files!).
Basically, you can't produce anything useful before parsing the whole
family.  The best you could do is giving bits of the sequences to the
iteratee, but I can't imagine how that would be useful, since the
sequences would be interleaved depending on how the Stockholm file was
created.

Cheers,

-- 
Felipe.
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Reply via email to