Let me be a bit more specific on why I proposed this: 

Three years ago I had a Bachelor student from Karlsruhe Institute of 
Technology. We worked on sequence alignment algorithms that can be parametrized 
by various kinds of uncertainty [1]. 

As you are probably aware, many protocols that utilize high-throughput 
sequencing alter the sequence prior to sequencing. For example, CLIP methods 
may or may not change C to T at cross-linking positions. Therefore the standard 
bowtie or bwa/bowtie algorithms don't work properly, as the assumption of 
sproradic and random mismatches is violated. To overcome this, people have come 
up with all kinds of scripts that build _around_ standard aligners, e.g. by 
aligning against a three-letter reference or by duplicating the reference with 
expected mutations pre-computed. 

Another use case is the ever growing list of known polymorphisms in the human 
and other reference genomes. Why should we first match against a single 
reference and later sort the mismatches into known polymorphisms and novel 
mutations, when one could incoprorate polymorphisms into the reference and have 
a one-pass algorithm? Even more so, we know that some polymorphisms are more 
abundant in certain regions of the world, so knowledge about the sample origin 
may have an effect on the alignment score. 

I claim that one could do a lot more resource-friendly and flexible alignments 
by tweaking the core alignment algorithm directly. In Haskell this can be done 
with relative ease (see below), but we lack a fast and compact full-text index. 
Instead of re-writing one, we should first try to use the existing, optimized, 
index structures. These come with efficient creation programs, so that we may 
focus on the alignment rather than the indexing. 

Hence reverse-engineering the FM-Index or Suffix Array may do the biohaskellers 
working with sequencing data a good deed. Apart from the raw representation, my 
student and I thought about a typeclass on top of which to formulate new 
alignment algorithms. We propose a class that essentially has the suffix trie 
as its free algebra: 

class Zero i where
  zero :: i a -> Bool -- needed to terminate recursion

class (MonadPlus m, Zero i) => Index m i where
  positions :: i a -> m Integer
  destruct  :: i a -> m (a, i a)

{-- The free structure is
FreeIndex m a =  (m Integer,m (a,i a))
~ m (Either Integer (a,i a))

Exercise: Write 
instance Index Maybe []
 --}

findMatches :: (Eq a, Index m i) => [a] -> i a -> m (i a)
findMathces [] i = return i
findMatches (x:xs) i = if zero i
  then mzero
  else do
    (y,i') <- destruct i
    if x==y then findSeq xs i' else mzero

matchPositions xs = positions <=< (findMatches xs)

For most existing index stuctures the monad m would be []. For an FM-index, the 
type i would be an interval in the Burrows-Wheeler transform, with 'zero' 
telling whether the interval is empty. Destructing such an interval produces 
the subintervals corresponding to single-letter extensions. 
Whatever the monad, if yor query is also uncertain in another monad t (e.g. 
probability distributions), you can still do alignments as long as t and m have 
a distributive law. 

Thus the proposed task is comprised of: 

1. Provide a Data.Binary instance for one of the index structures produced by 
bwa, bowtie or STAR. 
2. Write an Index instance for the Haskell representation. 

Olaf

[1] http://pp.ipd.kit.edu/publication.php?id=kuhnle13bachelorarbeit&lang=en

Reply via email to