Re: Dipping my toes in Haskell, via Bio.Sequence.Fastq

Christian Höner zu Siederdissen Wed, 22 Jul 2015 06:29:11 -0700

Hi Adam,

welcome ;-)


Ketil is the author of the bio package and might have more detailed
comments. In Principle though you seem to have averted the most typical
problem people have when trying to write some kind of 'averaging
program'.

Try this:
let xs = [1 .. 1000]
avg xs = sum xs / length xs

Now increase 1000 to something large and you programs crashes. However,
you have averted this by using a windowed @stats@ approach -- almost
that is.

What you still need to do is to make @stats@ and @foldl@ more strict.
For this you should check out @foldl'@ (note the prime) and bang
patterns for @stats@. I'm intentionally not giving the solution!

Once you have done that the program should work with arbitrarily long
inputs.

Gruss,
Christian

* Adam Sjøgren <a...@koldfront.dk> [22.07.2015 12:56]:
>   Hi,
> 
> 
> (Quick background: I'm used to Perl and have been trying to start
> learning Haskell a number of times, using every book I could lay my hand
> on, over the years, but never succeeded, maybe because I needed
> something "real" to use it for. The main problem for me is usually going
> from the nice, abstract concepts and into "making it work" in real life.
> I don't know the idioms of the language...)
> 
> I just had a go at using (Bio)Haskell to beat a simple C-program that
> reads through a fastq-file and simply counts the number of sequences,
> adds up the total length, and outputs those figures along with the
> average length.
> 
> I cobbled together this program using Google, Hoogle, documentation and
> guessing:
> 
>   import System.Environment
>   import Bio.Sequence.FastQ
>   import Bio.Core.Sequence
> 
>   main = do
>     [f] <- getArgs
>     putStrLn . output . average . foldl stats (0, 0) =<< readIllumina f
>       where stats (count, totalLength) s = (count+1, 
> totalLength+toInteger(seqlength s))
> 
>   average (count, totallength) = (count, totallength, t/c)
>     where t = fromIntegral totallength :: Float 
>           c = fromIntegral count :: Float
> 
>   output (count, length, average) = "Count " ++ show count ++ "\n" ++ "Total 
> " ++ show count ++ " records " ++ show length ++ " length " ++ show average 
> ++ " average"
> 
> Very simple, and surely not quite the way someone fluent in the language
> would write it.
> 
> On my test-example, which was a fastq file with ~200K sequences of 50 bp
> length each, the Haskell program beat the C program by a factor of 2+.
> 
> Nice! (I'm speculating that the penalty for explicit memory handling is
> a part of the difference, k*200K malloc+free calls must take some time...)
> 
> I have two questions:
> 
>  a) If anyone has time to suggest how this small program would be written
>     by someone fluent in Haskell, I would love to read and learn.
> 
>  b) If I run my program on a big file, the I get "Stack space overflow:
>     current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it."
> 
>     So I guess I am doing something that means that the entire file gets
>     read into memory at the same time - any pointers on reasoning about
>     this and fixing this are very welcome as well.
> 
>     Is this something like I am fold'ing in the wrong direction?
> 
> 
> I hope that using Haskell for some "real life" stuff will make it easier
> for me to get into it this time around.
> 
> 
>   Best regards,
> 
>     Adam
> 
> 
> P.S. I tried sending this back in March, but it doesn't seem to have
>      gotten through, and I got distracted. Apologies if you have seen it
>      before.
> 
> -- 
>  "You've got to be excited about what you are doing."         Adam Sjøgren
>                                                          a...@koldfront.dk

Re: Dipping my toes in Haskell, via Bio.Sequence.Fastq

Reply via email to