Peter C. wrote: > I have another suggestion for new or enhanced EMBOSS applications, > again related to the existing pairwise sequence alignment tools needle > and water. > > The FASTQ file format (or others) contains quality scores (often PHRED > scores) representing the probability of an error in the associated > nucleotide. Solexa/Illumina machines also provide another file with a > more precise breakdown of the likelihood of each of the four bases. > > In some cases both sequences could have probability scores (e.g. > trying to align the ends of contigs to each other), but often one > sequence will be taken as fact (e.g. mapping reads onto a reference). > > It is possible to take these probabilities into account when > considering the matches in needle (or water) by using a probabilistic > version of the Needleman‐Wunsch sequence alignment algorithm (or a > probabilistic Smith-Waterman). > > As an example of this idea, did you (Peter R) see the GNUMAP > talk/poster at ISMB 2009? See http://dna.cs.byu.edu/gnumap/
I saw the talk, and was wondering about their algorithm. They did not have a separate treatment for gaps in the redas and the consensus, which seemed like an obvious extension. > I am aware of people using EMBOSS tools (I assume water) to identify > (known) adaptor sequences in raw Solexa/Illumina data. I considered > doing something similar myself when trying to remove primer sequences > from 454 data. Such a pipeline using the current EMBOSS water would be > doing this matching at a purely fixed nucleotide level (ignoring the > qualities), which isn't ideal. Upgrading to a probabilistic version of > water should be an improvement. Would be interesting. Where can I look up adaptor calling methods? Peter Rice _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
