Hi Wolfgang, If I understand correctly you want to compare a collection of "sample" sequences against a "reference" sequence.
There are many ways to "compare" 2 sequences and to get a score from this comparison: Do you want to allow indels or just substitutions (aka replacements)? Do you want the alignment to be global or local? Do you want the score to be determined by a custom substitution matrix or just to reflect the "edit distance" between the 2 sequences (i.e. the minimum number of replacements + insertions + deletions required to transform sequence 1 into sequence 2). If you are trying to get the edit distance between the sample sequences and the reference sequence then you might want to look at this post: https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2008-July/000067.html Note that the 'type' argument is omitted in this post so pairwiseAlignment() is used to perform "global" alignments, that is, the score that is returned indicates the min nb of edit operations needed to transform the *complete* sequences in 'pattern' (your "sample" sequences) into the *complete* 'subject' (your "reference" sequence). One nice property of "global" alignment is that it is a symetric operation on the 2 sequences to align. You can take advantage of this by inverting the roles of the "sample" and the "reference" sequences so 'samp' can be the pattern and 'ref' the subject. Then you can benefit from the fact that pairwiseAlignment() is vectorized on the pattern. But then you say that you would primarily be interested in finding where a given nucleotide differs from the query. This suggests that you are in fact looking for a much simpler form of alignment where only mismatches (i.e. replacements) are allowed but no indels. Also you say that your sample-sequences may start or end slightly later/earlier. This means that you can use the matchPattern() function for your problem. > m2 <- matchPattern(samp[[2]], ref, max.mismatch=3) > m2 Views on a 19-letter DNAString subject subject: ACTTCACCAGCTCCCTGGC views: start end width [1] 1 18 18 [ACTTCACCAGCTCCCTGG] Then use the mismatch() function to get the position(s) of the mismatche(s): > mismatch(samp[[2]], m2)[[1]] [1] 6 13 Cheers, H. Wolfgang Raffelsberger wrote: > Dear list, > > I would like to count the occurrence of (mostly single) nucleotide > polymorphisms from nucleotide sequences. > I got across the Biostrings package and pairwiseAlignment() that allows > me to get closer to what I want but > 1) I noticed that the score produced from pairwiseAlignment() is quite > different to other implementations of the Needlaman-Wunsch alogorithm > (eg in EMBOSS) > 2) the score is not directly the information I 'm looking for since it's > a mixture of the gaps & mismatches (and I don't see if/how one could > modify that). > > However, I would primarily be interested in finding where a given > nucleotide differs from the query (from a pairwise alignment) to some > statistics on them, ie at which position I get which other element > instead. Note, that my sample-sequences may start or end slightly > later/earlier. > Any suggestions ? > > Sample code might look like (of course, my real sequences are longer ...): > > ref <- DNAString("ACTTCACCAGCTCCCTGGC") > samp <- > DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG")) > # the 3rd one has no mutations, it's simply shorter ... > pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat, gapOpening > = -5, gapExtension = -2) > alignScores <- numeric() > for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]], > substitutionMatrix = mat, gapOpening = -5, gapExtension = -2, scoreOnly=T) > alignScores # the 3rd sequence without mismatches gets worst score > > > (Based on a previous post on BioC) I just subscribed to > [email protected], but I don't know if I don't mange to > search the previous mail archives (on http://search.gmane.org/) since I > keep getting (general) Bioconductor messages. > > Thank's in advance, > Wolfgang > > > By the way, if that matters, I'm (still) running R-2.7.2 >> sessionInfo() > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252 > > > attached base packages: > [1] stats graphics grDevices datasets tcltk utils > methods base > other attached packages: > [1] Biostrings_2.8.18 svSocket_0.9-5 svIO_0.9-5 > R2HTML_1.59 svMisc_0.9-5 svIDE_0.9-5 > loaded via a namespace (and not attached): > [1] tools_2.7.2 > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Wolfgang Raffelsberger, PhD > Laboratoire de BioInformatique et Génomique Intégratives > CNRS UMR7104, IGBMC 1 rue Laurent Fries, 67404 Illkirch Strasbourg, > France > Tel (+33) 388 65 3300 Fax (+33) 388 65 3276 > wolfgang.raffelsberger (at) igbmc.fr > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
