Hi all, I have another suggestion for new or enhanced EMBOSS applications, again related to the existing pairwise sequence alignment tools needle and water.
The FASTQ file format (or others) contains quality scores (often PHRED scores) representing the probability of an error in the associated nucleotide. Solexa/Illumina machines also provide another file with a more precise breakdown of the likelihood of each of the four bases. In some cases both sequences could have probability scores (e.g. trying to align the ends of contigs to each other), but often one sequence will be taken as fact (e.g. mapping reads onto a reference). It is possible to take these probabilities into account when considering the matches in needle (or water) by using a probabilistic version of the Needleman‐Wunsch sequence alignment algorithm (or a probabilistic Smith-Waterman). As an example of this idea, did you (Peter R) see the GNUMAP talk/poster at ISMB 2009? See http://dna.cs.byu.edu/gnumap/ I am aware of people using EMBOSS tools (I assume water) to identify (known) adaptor sequences in raw Solexa/Illumina data. I considered doing something similar myself when trying to remove primer sequences from 454 data. Such a pipeline using the current EMBOSS water would be doing this matching at a purely fixed nucleotide level (ignoring the qualities), which isn't ideal. Upgrading to a probabilistic version of water should be an improvement. Peter C. _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
