Let me be a bit more specific on why I proposed this:
Three years ago I had a Bachelor student from Karlsruhe Institute of
Technology. We worked on sequence alignment algorithms that can be parametrized
by various kinds of uncertainty [1].
As you are probably aware, many protocols that utilize high-throughput
sequencing alter the sequence prior to sequencing. For example, CLIP methods
may or may not change C to T at cross-linking positions. Therefore the standard
bowtie or bwa/bowtie algorithms don't work properly, as the assumption of
sproradic and random mismatches is violated. To overcome this, people have come
up with all kinds of scripts that build _around_ standard aligners, e.g. by
aligning against a three-letter reference or by duplicating the reference with
expected mutations pre-computed.
Another use case is the ever growing list of known polymorphisms in the human
and other reference genomes. Why should we first match against a single
reference and later sort the mismatches into known polymorphisms and novel
mutations, when one could incoprorate polymorphisms into the reference and have
a one-pass algorithm? Even more so, we know that some polymorphisms are more
abundant in certain regions of the world, so knowledge about the sample origin
may have an effect on the alignment score.
I claim that one could do a lot more resource-friendly and flexible alignments
by tweaking the core alignment algorithm directly. In Haskell this can be done
with relative ease (see below), but we lack a fast and compact full-text index.
Instead of re-writing one, we should first try to use the existing, optimized,
index structures. These come with efficient creation programs, so that we may
focus on the alignment rather than the indexing.
Hence reverse-engineering the FM-Index or Suffix Array may do the biohaskellers
working with sequencing data a good deed. Apart from the raw representation, my
student and I thought about a typeclass on top of which to formulate new
alignment algorithms. We propose a class that essentially has the suffix trie
as its free algebra:
class Zero i where
zero :: i a -> Bool -- needed to terminate recursion
class (MonadPlus m, Zero i) => Index m i where
positions :: i a -> m Integer
destruct :: i a -> m (a, i a)
{-- The free structure is
FreeIndex m a = (m Integer,m (a,i a))
~ m (Either Integer (a,i a))
Exercise: Write
instance Index Maybe []
--}
findMatches :: (Eq a, Index m i) => [a] -> i a -> m (i a)
findMathces [] i = return i
findMatches (x:xs) i = if zero i
then mzero
else do
(y,i') <- destruct i
if x==y then findSeq xs i' else mzero
matchPositions xs = positions <=< (findMatches xs)
For most existing index stuctures the monad m would be []. For an FM-index, the
type i would be an interval in the Burrows-Wheeler transform, with 'zero'
telling whether the interval is empty. Destructing such an interval produces
the subintervals corresponding to single-letter extensions.
Whatever the monad, if yor query is also uncertain in another monad t (e.g.
probability distributions), you can still do alignments as long as t and m have
a distributive law.
Thus the proposed task is comprised of:
1. Provide a Data.Binary instance for one of the index structures produced by
bwa, bowtie or STAR.
2. Write an Index instance for the Haskell representation.
Olaf
[1] http://pp.ipd.kit.edu/publication.php?id=kuhnle13bachelorarbeit&lang=en