Re: Dipping my toes in Haskell, via Bio.Sequence.Fastq

Olaf Klinke Wed, 22 Jul 2015 14:25:38 -0700

Dear Adam, 

your program looks almost exaclty how I'd write it, expect for the foldl' 
Christian mentioned. I also doubt that the Haskell program can really 
outperform a well-written C program on such a simple task. 
In my eyes, the strength of Haskell is hidden in the readIllumina function: 
Bioinformatics is 50% parsing and converting text formats. And Haskell excels 
at writing parsers and representing the parsed data in data structures. Another 
strength is very neatly displayed in your program: It is the modelling of data 
processing as a concatenation of functions. I frequently had a (pure) top-level 
function like this:


workflow :: String -> [String]
workflow = (map someOutputFormatter).someProcessing.(filter someCriterion).(map 
parseMyFormat).lines

and the main function was then just:

main :: IO ()
main = (mapM_ putStrLn) . workflow =<< getContents

Just think of how ugly this looks in a (fantasy) imperative language: 

lines[] = chunksOfInputFromSTDIN();
i = 0;
o = 0;
while(i < length(lines)) {
        parsed = parseMyFormat(lines[i]);
        if (someCriterion(parsed)) {
                kept[o] = parsed;
                o++;
                }
        i++;
        }
intermediate[] = someProcessing(kept);
i = 0;
while(i < length(intermediate)) {
        outLine = someOutputFormatter(intermediate[i]);
        printf("%s\n",outLine);
        i++;
        }


Lazy I/O can be a problem with large input files. You can use the strict-io 
package or, which works for many cases, write the program as a stream 
processor, which at the same time keeps memory requirement at a minimum. 
Haskell is lazy by default, meaning it applies a function only when demanded. 
One way do demand this is to print its result. 
Suppose you have a line-based input format (GFF, SAM, Fastq, you name it). 
Suppose you have some other data to compare each record against (e.g. 
annotation). Write your program so that it reads the annotation and puts it 
into some kind of lookup table format. Then, manufacture a function that reads 
a line, does the work on it and prints the result. Then mapM_ this function. 
This way, only the current record and the lookup table will be in memory and 
the printing will force the execution of the worker function, record after 
record. 

Cheers, 
Olaf

Re: Dipping my toes in Haskell, via Bio.Sequence.Fastq

Reply via email to