Dear Adam, your program looks almost exaclty how I'd write it, expect for the foldl' Christian mentioned. I also doubt that the Haskell program can really outperform a well-written C program on such a simple task. In my eyes, the strength of Haskell is hidden in the readIllumina function: Bioinformatics is 50% parsing and converting text formats. And Haskell excels at writing parsers and representing the parsed data in data structures. Another strength is very neatly displayed in your program: It is the modelling of data processing as a concatenation of functions. I frequently had a (pure) top-level function like this:
workflow :: String -> [String] workflow = (map someOutputFormatter).someProcessing.(filter someCriterion).(map parseMyFormat).lines and the main function was then just: main :: IO () main = (mapM_ putStrLn) . workflow =<< getContents Just think of how ugly this looks in a (fantasy) imperative language: lines[] = chunksOfInputFromSTDIN(); i = 0; o = 0; while(i < length(lines)) { parsed = parseMyFormat(lines[i]); if (someCriterion(parsed)) { kept[o] = parsed; o++; } i++; } intermediate[] = someProcessing(kept); i = 0; while(i < length(intermediate)) { outLine = someOutputFormatter(intermediate[i]); printf("%s\n",outLine); i++; } Lazy I/O can be a problem with large input files. You can use the strict-io package or, which works for many cases, write the program as a stream processor, which at the same time keeps memory requirement at a minimum. Haskell is lazy by default, meaning it applies a function only when demanded. One way do demand this is to print its result. Suppose you have a line-based input format (GFF, SAM, Fastq, you name it). Suppose you have some other data to compare each record against (e.g. annotation). Write your program so that it reads the annotation and puts it into some kind of lookup table format. Then, manufacture a function that reads a line, does the work on it and prints the result. Then mapM_ this function. This way, only the current record and the lookup table will be in memory and the printing will force the execution of the worker function, record after record. Cheers, Olaf