Re: Dipping my toes in Haskell, via Bio.Sequence.Fastq

Bob Ippolito Wed, 22 Jul 2015 10:02:12 -0700

Here are some more tips on strictness and Haskell's evaluation model:

http://chimera.labs.oreilly.com/books/1230000000929/ch02.html#sec_par-eval-whnf
https://hackhands.com/lazy-evaluation-works-haskell/


On Wed, Jul 22, 2015 at 6:20 AM, Christian Höner zu Siederdissen <
choe...@bioinf.uni-leipzig.de> wrote:

> Hi Adam,
>
> welcome ;-)
>
> Ketil is the author of the bio package and might have more detailed
> comments. In Principle though you seem to have averted the most typical
> problem people have when trying to write some kind of 'averaging
> program'.
>
> Try this:
> let xs = [1 .. 1000]
> avg xs = sum xs / length xs
>
> Now increase 1000 to something large and you programs crashes. However,
> you have averted this by using a windowed @stats@ approach -- almost
> that is.
>
> What you still need to do is to make @stats@ and @foldl@ more strict.
> For this you should check out @foldl'@ (note the prime) and bang
> patterns for @stats@. I'm intentionally not giving the solution!
>
> Once you have done that the program should work with arbitrarily long
> inputs.
>
> Gruss,
> Christian
>
> * Adam Sjøgren <a...@koldfront.dk> [22.07.2015 12:56]:
> >   Hi,
> >
> >
> > (Quick background: I'm used to Perl and have been trying to start
> > learning Haskell a number of times, using every book I could lay my hand
> > on, over the years, but never succeeded, maybe because I needed
> > something "real" to use it for. The main problem for me is usually going
> > from the nice, abstract concepts and into "making it work" in real life.
> > I don't know the idioms of the language...)
> >
> > I just had a go at using (Bio)Haskell to beat a simple C-program that
> > reads through a fastq-file and simply counts the number of sequences,
> > adds up the total length, and outputs those figures along with the
> > average length.
> >
> > I cobbled together this program using Google, Hoogle, documentation and
> > guessing:
> >
> >   import System.Environment
> >   import Bio.Sequence.FastQ
> >   import Bio.Core.Sequence
> >
> >   main = do
> >     [f] <- getArgs
> >     putStrLn . output . average . foldl stats (0, 0) =<< readIllumina f
> >       where stats (count, totalLength) s = (count+1,
> totalLength+toInteger(seqlength s))
> >
> >   average (count, totallength) = (count, totallength, t/c)
> >     where t = fromIntegral totallength :: Float
> >           c = fromIntegral count :: Float
> >
> >   output (count, length, average) = "Count " ++ show count ++ "\n" ++
> "Total " ++ show count ++ " records " ++ show length ++ " length " ++ show
> average ++ " average"
> >
> > Very simple, and surely not quite the way someone fluent in the language
> > would write it.
> >
> > On my test-example, which was a fastq file with ~200K sequences of 50 bp
> > length each, the Haskell program beat the C program by a factor of 2+.
> >
> > Nice! (I'm speculating that the penalty for explicit memory handling is
> > a part of the difference, k*200K malloc+free calls must take some
> time...)
> >
> > I have two questions:
> >
> >  a) If anyone has time to suggest how this small program would be written
> >     by someone fluent in Haskell, I would love to read and learn.
> >
> >  b) If I run my program on a big file, the I get "Stack space overflow:
> >     current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it."
> >
> >     So I guess I am doing something that means that the entire file gets
> >     read into memory at the same time - any pointers on reasoning about
> >     this and fixing this are very welcome as well.
> >
> >     Is this something like I am fold'ing in the wrong direction?
> >
> >
> > I hope that using Haskell for some "real life" stuff will make it easier
> > for me to get into it this time around.
> >
> >
> >   Best regards,
> >
> >     Adam
> >
> >
> > P.S. I tried sending this back in March, but it doesn't seem to have
> >      gotten through, and I got distracted. Apologies if you have seen it
> >      before.
> >
> > --
> >  "You've got to be excited about what you are doing."         Adam
> Sjøgren
> >
> a...@koldfront.dk
>

Re: Dipping my toes in Haskell, via Bio.Sequence.Fastq

Reply via email to