Re: [Biohaskell] Infernal-related stuff

Felipe Almeida Lessa Thu, 26 Jan 2012 08:52:31 -0800

Hey, Christian!

Sorry for replying to an old e-mail.  I've rewritten biostockholm
using two layers, one that parses ByteStrings into Events and another
one that parses Events into an Stockholm data structure as before.  An
Event is something very simple, akin to what a SAX parser would give
you.


As an example, suppose that you wanted to take only the accession
codes of the families on a gzipped Stockholm file.  With the new
biostockholm library, it's as simple as:

{-# LANGUAGE OverloadedStrings #-}

import Bio.Sequence.Stockholm.Stream (parseEvents, Event(..))
import Data.Conduit (runResourceT, ($$), (=$))
import Data.Conduit.Zlib (ungzip)
import System.Environment (getArgs)
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL

main :: IO ()
main = getArgs >>= \[input] -> runResourceT (pipeline input)
    where
      pipeline input =  CB.sourceFile input
                     $$ ungzip
                     =$ parseEvents
                     =$ CL.mapM_ printName

      printName (EvGF "AC" name) = print name
      printName _                = return ()

Running this program with Rfam 9.1's full Stockholm file takes 136s,
uses 2 MiB of memory and has an alloc rate of 1.7 GB/s on my machine
=).  This translates to 12 MiB/s of uncompressed data or 1.3 MiB/s of
compressed data.  (Of course an special parser looking for '#=GF AC'
would be much faster, but the point is that the new streaming parser
is fast enough.)

I'll do a proper release on Hackage with an announcement here later,
but you may grab the code right now with "darcs get
http://patch-tag.com/r/felipe/biostockholm";.

Cheers! =D

On Sat, Dec 3, 2011 at 6:11 PM, Christian Höner zu Siederdissen
<choe...@tbi.univie.ac.at> wrote:
> I've played with this a bit. There are two possibilities, either fast
> parsing and high memory overhead (somewhere in the range of 3-?x) or
> slow parsing and very low overhead -- mostly determined by either
> repeatedly copying strict bytestrings or concatenating lazy ones.
>
> I'm currently thinking the latter is better. Consider the Rfam-10.1 tRNA
> model: 1,542,5748 lines and 1,447,813,039 byte. This can be parsed in
> 927 seconds at 2236 Mbyte total memory in use according to the "-s"
> option.
>
> Obviously, the parser is not complete. Right now, you get all of the
> key/value pairs in a map. A second map of all the sequence names to INT
> identifiers and a vector of bytestrings with the sequence information,
> which can be indexed by the sequence-name map.
>
> (And in the 2236 mbyte, there is some cheating hidden in there, as the
> whitespace bytes between sequence name and sequence data can be
> dropped).
>
> The parser is one-pass and works off compressed files which can be huge
> win compared to mmap (which requires uncompressed files), albeit with
> longer runtimes. I do prefer compressed data, however and would rather
> have all Infernal-related stuff have gzip decompression built-in (which
> is mostly true for BiobaseInfernal).
>
> Furthermore, one-pass statistics can be done in constant speed and
> incredibly low memory. We are mainly bound by disk speed in that case.
>
> What I currently don't know is how often we'd actually need full.gz
> data, hence if it makes sense to put much time into this. I'll clean
> everything up and put it on github, anyway.
>
>
> Gruss,
> Christian
>
> * Felipe Almeida Lessa <felipe.le...@gmail.com> [01.12.2011 17:01]:
>> On Thu, Dec 1, 2011 at 12:56 PM, Christian Höner zu Siederdissen
>> <choe...@tbi.univie.ac.at> wrote:
>> > ok a very simple system parses Rfam-9.1.full.gz in 2800 Mbyte and 42
>> > seconds. What you get is each "STOCKHOLM" to "//" range as a list of
>> > bytestring lines. That is still "kind of" suboptimal and I'd like to
>> > change some stuff. If in any way possible, I'd like to get memory
>> > consumption down to the number of bytes you need for the full model plus
>> > a small overhead.
>> >
>> > I think, there will always be some overhead due to late garbage
>> > collection, but if we can parse the complete Rfam.full.gz in less than,
>> > say 4 gbyte, it would be extremely cool.
>>
>> I've changed my test program to use mmap'ed files.  The memory usage
>> is almost the same.  I've attached a heap profile.
>>
>> So, almost 2/3 of the memory is spent with 'Chunk's and almost 1/3 of
>> the memory are lists.  In biostockholm I represent unfinished parts of
>> the Stockholm file as a list of chunks, before concat'enating them.
>> Given that with that mmap the input is a single big bytestring, *all*
>> of these Chunks are of the form "Chunk something Empty", which means
>> that they use only 2 words (IIRC) more than a strict bytestring would.
>>  A strict bytestring uses 9 words [1].  Each (:) takes at least 3
>> words.  So each part of a sequence uses at least 14 words, which on my
>> machine means 112 bytes.
>>
>> In sum, the problem then seems to be the fact that Rfam's files break
>> sequences on something like 60 or 70 columns.  That means that even
>> with an mmap'ed input we need to keep a list of pointers to every
>> single piece.  Unfortunately, those "pointers" are very costly.  If
>> you really need something fast here, you'd need to do something like:
>>
>>   a) Stop-gap measure: change biostockholm to strictly append strict
>> ByteStrings, using O(n²) time copying them but using only a constant
>> amount of overhead per sequence.
>>
>>   b) Clever solution: use a two-pass algorithm when reading families.
>> On a first pass, calculate the length of the multiple alignment.  On
>> the second pass, allocate mutable buffers with the correct size and
>> then copy the contents there.
>>
>> Cheers,
>>
>> [1] 
>> http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
>>
>> --
>> Felipe.
>
>
>



-- 
Felipe.
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Infernal-related stuff

Reply via email to