Let me be a bit more specific on why I proposed this: Three years ago I had a Bachelor student from Karlsruhe Institute of Technology. We worked on sequence alignment algorithms that can be parametrized by various kinds of uncertainty [1].
As you are probably aware, many protocols that utilize high-throughput sequencing alter the sequence prior to sequencing. For example, CLIP methods may or may not change C to T at cross-linking positions. Therefore the standard bowtie or bwa/bowtie algorithms don't work properly, as the assumption of sproradic and random mismatches is violated. To overcome this, people have come up with all kinds of scripts that build _around_ standard aligners, e.g. by aligning against a three-letter reference or by duplicating the reference with expected mutations pre-computed. Another use case is the ever growing list of known polymorphisms in the human and other reference genomes. Why should we first match against a single reference and later sort the mismatches into known polymorphisms and novel mutations, when one could incoprorate polymorphisms into the reference and have a one-pass algorithm? Even more so, we know that some polymorphisms are more abundant in certain regions of the world, so knowledge about the sample origin may have an effect on the alignment score. I claim that one could do a lot more resource-friendly and flexible alignments by tweaking the core alignment algorithm directly. In Haskell this can be done with relative ease (see below), but we lack a fast and compact full-text index. Instead of re-writing one, we should first try to use the existing, optimized, index structures. These come with efficient creation programs, so that we may focus on the alignment rather than the indexing. Hence reverse-engineering the FM-Index or Suffix Array may do the biohaskellers working with sequencing data a good deed. Apart from the raw representation, my student and I thought about a typeclass on top of which to formulate new alignment algorithms. We propose a class that essentially has the suffix trie as its free algebra: class Zero i where zero :: i a -> Bool -- needed to terminate recursion class (MonadPlus m, Zero i) => Index m i where positions :: i a -> m Integer destruct :: i a -> m (a, i a) {-- The free structure is FreeIndex m a = (m Integer,m (a,i a)) ~ m (Either Integer (a,i a)) Exercise: Write instance Index Maybe [] --} findMatches :: (Eq a, Index m i) => [a] -> i a -> m (i a) findMathces [] i = return i findMatches (x:xs) i = if zero i then mzero else do (y,i') <- destruct i if x==y then findSeq xs i' else mzero matchPositions xs = positions <=< (findMatches xs) For most existing index stuctures the monad m would be []. For an FM-index, the type i would be an interval in the Burrows-Wheeler transform, with 'zero' telling whether the interval is empty. Destructing such an interval produces the subintervals corresponding to single-letter extensions. Whatever the monad, if yor query is also uncertain in another monad t (e.g. probability distributions), you can still do alignments as long as t and m have a distributive law. Thus the proposed task is comprised of: 1. Provide a Data.Binary instance for one of the index structures produced by bwa, bowtie or STAR. 2. Write an Index instance for the Haskell representation. Olaf [1] http://pp.ipd.kit.edu/publication.php?id=kuhnle13bachelorarbeit&lang=en