Hi,
(Quick background: I'm used to Perl and have been trying to start
learning Haskell a number of times, using every book I could lay my hand
on, over the years, but never succeeded, maybe because I needed
something "real" to use it for. The main problem for me is usually going
from the nice, abstract concepts and into "making it work" in real life.
I don't know the idioms of the language...)
I just had a go at using (Bio)Haskell to beat a simple C-program that
reads through a fastq-file and simply counts the number of sequences,
adds up the total length, and outputs those figures along with the
average length.
I cobbled together this program using Google, Hoogle, documentation and
guessing:
import System.Environment
import Bio.Sequence.FastQ
import Bio.Core.Sequence
main = do
[f] <- getArgs
putStrLn . output . average . foldl stats (0, 0) =<< readIllumina f
where stats (count, totalLength) s = (count+1,
totalLength+toInteger(seqlength s))
average (count, totallength) = (count, totallength, t/c)
where t = fromIntegral totallength :: Float
c = fromIntegral count :: Float
output (count, length, average) = "Count " ++ show count ++ "\n" ++ "Total "
++ show count ++ " records " ++ show length ++ " length " ++ show average ++ "
average"
Very simple, and surely not quite the way someone fluent in the language
would write it.
On my test-example, which was a fastq file with ~200K sequences of 50 bp
length each, the Haskell program beat the C program by a factor of 2+.
Nice! (I'm speculating that the penalty for explicit memory handling is
a part of the difference, k*200K malloc+free calls must take some time...)
I have two questions:
a) If anyone has time to suggest how this small program would be written
by someone fluent in Haskell, I would love to read and learn.
b) If I run my program on a big file, the I get "Stack space overflow:
current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it."
So I guess I am doing something that means that the entire file gets
read into memory at the same time - any pointers on reasoning about
this and fixing this are very welcome as well.
Is this something like I am fold'ing in the wrong direction?
I hope that using Haskell for some "real life" stuff will make it easier
for me to get into it this time around.
Best regards,
Adam
P.S. I tried sending this back in March, but it doesn't seem to have
gotten through, and I got distracted. Apologies if you have seen it
before.
--
"You've got to be excited about what you are doing." Adam Sjøgren
[email protected]