Re: [Biohaskell] Infernal-related stuff

Christian Höner zu Siederdissen Sat, 03 Dec 2011 12:11:52 -0800

Hi,

I've played with this a bit. There are two possibilities, either fast
parsing and high memory overhead (somewhere in the range of 3-?x) or
slow parsing and very low overhead -- mostly determined by either
repeatedly copying strict bytestrings or concatenating lazy ones.

I'm currently thinking the latter is better. Consider the Rfam-10.1 tRNA
model: 1,542,5748 lines and 1,447,813,039 byte. This can be parsed in
927 seconds at 2236 Mbyte total memory in use according to the "-s"
option.

Obviously, the parser is not complete. Right now, you get all of the
key/value pairs in a map. A second map of all the sequence names to INT
identifiers and a vector of bytestrings with the sequence information,
which can be indexed by the sequence-name map.

(And in the 2236 mbyte, there is some cheating hidden in there, as the
whitespace bytes between sequence name and sequence data can be
dropped).

The parser is one-pass and works off compressed files which can be huge
win compared to mmap (which requires uncompressed files), albeit with
longer runtimes. I do prefer compressed data, however and would rather
have all Infernal-related stuff have gzip decompression built-in (which
is mostly true for BiobaseInfernal).

Furthermore, one-pass statistics can be done in constant speed and
incredibly low memory. We are mainly bound by disk speed in that case.

What I currently don't know is how often we'd actually need full.gz
data, hence if it makes sense to put much time into this. I'll clean
everything up and put it on github, anyway.

Gruss,
Christian

* Felipe Almeida Lessa <felipe.le...@gmail.com> [01.12.2011 17:01]:
> On Thu, Dec 1, 2011 at 12:56 PM, Christian Höner zu Siederdissen
> <choe...@tbi.univie.ac.at> wrote:
> > ok a very simple system parses Rfam-9.1.full.gz in 2800 Mbyte and 42
> > seconds. What you get is each "STOCKHOLM" to "//" range as a list of
> > bytestring lines. That is still "kind of" suboptimal and I'd like to
> > change some stuff. If in any way possible, I'd like to get memory
> > consumption down to the number of bytes you need for the full model plus
> > a small overhead.
> >
> > I think, there will always be some overhead due to late garbage
> > collection, but if we can parse the complete Rfam.full.gz in less than,
> > say 4 gbyte, it would be extremely cool.
> 
> I've changed my test program to use mmap'ed files.  The memory usage
> is almost the same.  I've attached a heap profile.
> 
> So, almost 2/3 of the memory is spent with 'Chunk's and almost 1/3 of
> the memory are lists.  In biostockholm I represent unfinished parts of
> the Stockholm file as a list of chunks, before concat'enating them.
> Given that with that mmap the input is a single big bytestring, *all*
> of these Chunks are of the form "Chunk something Empty", which means
> that they use only 2 words (IIRC) more than a strict bytestring would.
>  A strict bytestring uses 9 words [1].  Each (:) takes at least 3
> words.  So each part of a sequence uses at least 14 words, which on my
> machine means 112 bytes.
> 
> In sum, the problem then seems to be the fact that Rfam's files break
> sequences on something like 60 or 70 columns.  That means that even
> with an mmap'ed input we need to keep a list of pointers to every
> single piece.  Unfortunately, those "pointers" are very costly.  If
> you really need something fast here, you'd need to do something like:
> 
>   a) Stop-gap measure: change biostockholm to strictly append strict
> ByteStrings, using O(n²) time copying them but using only a constant
> amount of overhead per sequence.
> 
>   b) Clever solution: use a two-pass algorithm when reading families.
> On a first pass, calculate the length of the multiple alignment.  On
> the second pass, allocate mutable buffers with the correct size and
> then copy the contents there.
> 
> Cheers,
> 
> [1] 
> http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
> 
> -- 
> Felipe.

pgp0SwxwWN3Kl.pgp
Description: PGP signature

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Infernal-related stuff

Reply via email to