Hi,

(Quick background: I'm used to Perl and have been trying to start
learning Haskell a number of times, using every book I could lay my hand
on, over the years, but never succeeded, maybe because I needed
something "real" to use it for. The main problem for me is usually going
from the nice, abstract concepts and into "making it work" in real life.
I don't know the idioms of the language...)

I just had a go at using (Bio)Haskell to beat a simple C-program that
reads through a fastq-file and simply counts the number of sequences,
adds up the total length, and outputs those figures along with the
average length.

I cobbled together this program using Google, Hoogle, documentation and
guessing:

  import System.Environment
  import Bio.Sequence.FastQ
  import Bio.Core.Sequence

  main = do
    [f] <- getArgs
    putStrLn . output . average . foldl stats (0, 0) =<< readIllumina f
      where stats (count, totalLength) s = (count+1, 
totalLength+toInteger(seqlength s))

  average (count, totallength) = (count, totallength, t/c)
    where t = fromIntegral totallength :: Float 
          c = fromIntegral count :: Float

  output (count, length, average) = "Count " ++ show count ++ "\n" ++ "Total " 
++ show count ++ " records " ++ show length ++ " length " ++ show average ++ " 
average"

Very simple, and surely not quite the way someone fluent in the language
would write it.

On my test-example, which was a fastq file with ~200K sequences of 50 bp
length each, the Haskell program beat the C program by a factor of 2+.

Nice! (I'm speculating that the penalty for explicit memory handling is
a part of the difference, k*200K malloc+free calls must take some time...)

I have two questions:

 a) If anyone has time to suggest how this small program would be written
    by someone fluent in Haskell, I would love to read and learn.

 b) If I run my program on a big file, the I get "Stack space overflow:
    current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it."

    So I guess I am doing something that means that the entire file gets
    read into memory at the same time - any pointers on reasoning about
    this and fixing this are very welcome as well.

    Is this something like I am fold'ing in the wrong direction?


I hope that using Haskell for some "real life" stuff will make it easier
for me to get into it this time around.


  Best regards,

    Adam


P.S. I tried sending this back in March, but it doesn't seem to have
     gotten through, and I got distracted. Apologies if you have seen it
     before.

-- 
 "You've got to be excited about what you are doing."         Adam Sjøgren
                                                         a...@koldfront.dk

Reply via email to