Here are some more tips on strictness and Haskell's evaluation model: http://chimera.labs.oreilly.com/books/1230000000929/ch02.html#sec_par-eval-whnf https://hackhands.com/lazy-evaluation-works-haskell/
On Wed, Jul 22, 2015 at 6:20 AM, Christian Höner zu Siederdissen < choe...@bioinf.uni-leipzig.de> wrote: > Hi Adam, > > welcome ;-) > > Ketil is the author of the bio package and might have more detailed > comments. In Principle though you seem to have averted the most typical > problem people have when trying to write some kind of 'averaging > program'. > > Try this: > let xs = [1 .. 1000] > avg xs = sum xs / length xs > > Now increase 1000 to something large and you programs crashes. However, > you have averted this by using a windowed @stats@ approach -- almost > that is. > > What you still need to do is to make @stats@ and @foldl@ more strict. > For this you should check out @foldl'@ (note the prime) and bang > patterns for @stats@. I'm intentionally not giving the solution! > > Once you have done that the program should work with arbitrarily long > inputs. > > Gruss, > Christian > > * Adam Sjøgren <a...@koldfront.dk> [22.07.2015 12:56]: > > Hi, > > > > > > (Quick background: I'm used to Perl and have been trying to start > > learning Haskell a number of times, using every book I could lay my hand > > on, over the years, but never succeeded, maybe because I needed > > something "real" to use it for. The main problem for me is usually going > > from the nice, abstract concepts and into "making it work" in real life. > > I don't know the idioms of the language...) > > > > I just had a go at using (Bio)Haskell to beat a simple C-program that > > reads through a fastq-file and simply counts the number of sequences, > > adds up the total length, and outputs those figures along with the > > average length. > > > > I cobbled together this program using Google, Hoogle, documentation and > > guessing: > > > > import System.Environment > > import Bio.Sequence.FastQ > > import Bio.Core.Sequence > > > > main = do > > [f] <- getArgs > > putStrLn . output . average . foldl stats (0, 0) =<< readIllumina f > > where stats (count, totalLength) s = (count+1, > totalLength+toInteger(seqlength s)) > > > > average (count, totallength) = (count, totallength, t/c) > > where t = fromIntegral totallength :: Float > > c = fromIntegral count :: Float > > > > output (count, length, average) = "Count " ++ show count ++ "\n" ++ > "Total " ++ show count ++ " records " ++ show length ++ " length " ++ show > average ++ " average" > > > > Very simple, and surely not quite the way someone fluent in the language > > would write it. > > > > On my test-example, which was a fastq file with ~200K sequences of 50 bp > > length each, the Haskell program beat the C program by a factor of 2+. > > > > Nice! (I'm speculating that the penalty for explicit memory handling is > > a part of the difference, k*200K malloc+free calls must take some > time...) > > > > I have two questions: > > > > a) If anyone has time to suggest how this small program would be written > > by someone fluent in Haskell, I would love to read and learn. > > > > b) If I run my program on a big file, the I get "Stack space overflow: > > current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it." > > > > So I guess I am doing something that means that the entire file gets > > read into memory at the same time - any pointers on reasoning about > > this and fixing this are very welcome as well. > > > > Is this something like I am fold'ing in the wrong direction? > > > > > > I hope that using Haskell for some "real life" stuff will make it easier > > for me to get into it this time around. > > > > > > Best regards, > > > > Adam > > > > > > P.S. I tried sending this back in March, but it doesn't seem to have > > gotten through, and I got distracted. Apologies if you have seen it > > before. > > > > -- > > "You've got to be excited about what you are doing." Adam > Sjøgren > > > a...@koldfront.dk >