Wolfgang,
I'm glad to hear you are finding the Biostrings package to be useful.
The pairwiseAlignment function has some helper functions that would be
useful to a SNP analysis. In particular, you may want to take a look at
the mismatchTable, mismatchSummary, summary, and consensusMatrix
functions to see if they provide the summaries you are looking for. I
also recommend you leverage the vectorized capabilities of the
pairwiseAlignment function since it will be easier to maintain your
code, provide faster execution time, and make it easier to summarize
your results. I have modified your code slightly to show you what you
can get "out of the box" from the Biostrings package without writing too
much code:
library(Biostrings)
# the substitution matrix
mat <- matrix(-3, nrow = 5, ncol = 5)
diag(mat) <- 1
mat[,5] <- 0; mat[5,] <- 0
rownames(mat) <- colnames(mat) <- DNA_ALPHABET[c(1:4,15)]
# the consensus reference sequence
ref <- DNAString("ACTTCACCAGCTCCCTGGC")
# the sample sequences; the 3rd one has no mutations, it's simply
shorter ...
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
# vectorized alignments using the ref as the "subject"
alignments <-
pairwiseAlignment(samp, ref, substitutionMatrix=mat,
gapOpening= -5, gapExtension= -2, scoreOnly=FALSE,
type="global")
# summaries useful for SNP analysis
summary(alignments)
mismatchSummary(alignments)
mismatchTable(alignments)
consensusMatrix(alignments, baseOnly = TRUE)
Patrick
Wolfgang Raffelsberger wrote:
Hi Herve,
thank you for your helpful answer.
Initially I was wondering if there are already some functions/packages
directly for my tasks available, as Bioconducter and CRAN are very
strong resources.
Since your Biostrings package provides all the basic functionalities,
I've built a few (additional) functions (on top of Biostrings) doing the
job.
To be more precise, yes, I'm interested in detecting all alterations
(mutations+ insertions + deletions). However, within the project's
biological context mutations are the prime focus. Right now I'm rather
in an exploratory/pilot phase for the nucleotide-alteration/SNP
counting, so I'll probably improve & expand some of my quickly written
code.
For the sequence alignments I prefer global, since I'd rather want the
'best' match possible (ie the default setting with pairwiseAlignment()
). Concerning the choice of (custom) substitution matrices I'm not very
decided. Right now I'm using very generic ones (like in the Biostrings
vignette). Recently I've learned that there are several 'versions' of
the Needleman-Wunsch algorithm (ie various improvements to the original
one), so I'm not any more surprised of seeing different scores from
different implementations...
Anyway, once it's clear which sample/target (ie sequencer output) should
be compared with which reference sequence (ie the consenus genomic
sequence of the very region to be tested) I don't use the score any more
and luckily my test-data contain few completely unexpected 'alien'
sequences. For this initial step now I use matchPattern() - thank's for
your hint- where I primarily need to test if sequences have to be read
as inverse/complement.
Again, after a bit more digging I found the corresponding functions and
I'm quite happy with the performance !
At the core of SNP/alterations counting I dissect the pairwise alignment
and record what kind of changes have happened (at this level I noted
some changes to the new BioC 2.3 implementation ...see also code below).
Here I use (so far) strsplit() to make two vectors with individual
nucleotides (and gaps) to extract the positions&changes from each of the
lines of the pairwise alignment (see code below). I guess there may be
other more elegant/faster ways to do this...
As I'm primarily interested in 'simple mutations', so far I took a
shortcut consisting in simply looking at the first of inserted
nucleotide in cases of insertions (some additional lines not part of the
code below). Maybe later I'll get back to expand this ... After all I
end up with a simple matrix where each row corresponds to a position of
the reference sequence and the columns contain counts of the
events/alterations observed (where at least I'm able to tell if any
insertions have happened at all) where I can run classical statistical
tests.
Cheers,
Wolfgang
# here the way I 'parse' pairwise alignments :
library(Biostrings)
mat <- matrix(-3, nrow = 5, ncol = 5)
diag(mat) <- 1
mat[,5] <- 0; mat[5,] <- 0
rownames(mat) <- colnames(mat) <- DNA_ALPHABET[c(1:4,15)]
ref <- DNAString("ACTTCACCAGCTCCCTGGC") # the consensus
reference sequence
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
# the 3rd one has no mutations, it's simply shorter ...
# here (simplified) I already know that my sequence is oriented correctly
alignRes <- matrix(nr=nchar(ref),nc=length(samp))
for(i in 1:length(samp)) {
x <- pairwiseAlignment(as.character(ref), samp[[i]],
substitutionMatrix= mat, gapOpening= -5, gapExtension= -2,scoreOnly=F,
type="global")
cutsamp <- unlist(strsplit(as.character([EMAIL PROTECTED]),""))
cutref <- unlist(strsplit(as.character([EMAIL PROTECTED]),""))
if(nchar(ref) != length(cutref)) {
origRefNt <- unlist(strsplit(as.character(ref),""))
offS <- 1
while(origRefNt[1+offS] != cutref[1]) offS <- offS+1
}
out <- rep("Z",nchar(ref)) # use
"Z" for (default) constant
if( !identical(cutsamp,cutref)) {
mutPos <- !(cutref== cutsamp)
out[mutPos+offS] <- cutsamp[mutPos+offS]
if(sum(cutref=="-")>0) out <- out[-which(cutref=="-")] # remove
gaps in reference
# gaps could/can be retained at this stage only to a limited degree
by additional subsitutions
}
alignRes[,i] <- out
}
sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32
locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.10.0 IRanges_1.0.1
loaded via a namespace (and not attached):
[1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16
Herve Pages a écrit :
Hi Wolfgang,
If I understand correctly you want to compare a collection of "sample"
sequences against a "reference" sequence.
There are many ways to "compare" 2 sequences and to get a score from
this
comparison: Do you want to allow indels or just substitutions (aka
replacements)?
Do you want the alignment to be global or local? Do you want the
score to be
determined by a custom substitution matrix or just to reflect the "edit
distance" between the 2 sequences (i.e. the minimum number of
replacements
+ insertions + deletions required to transform sequence 1 into
sequence 2).
If you are trying to get the edit distance between the sample sequences
and the reference sequence then you might want to look at this post:
https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2008-July/000067.html
Note that the 'type' argument is omitted in this post so
pairwiseAlignment()
is used to perform "global" alignments, that is, the score that is
returned
indicates the min nb of edit operations needed to transform the
*complete*
sequences in 'pattern' (your "sample" sequences) into the *complete*
'subject'
(your "reference" sequence).
One nice property of "global" alignment is that it is a symetric
operation on
the 2 sequences to align. You can take advantage of this by inverting
the roles
of the "sample" and the "reference" sequences so 'samp' can be the
pattern
and 'ref' the subject. Then you can benefit from the fact that
pairwiseAlignment() is vectorized on the pattern.
But then you say that you would primarily be interested in finding where
a given nucleotide differs from the query. This suggests that you are
in fact
looking for a much simpler form of alignment where only mismatches (i.e.
replacements) are allowed but no indels. Also you say that your
sample-sequences may start or end slightly later/earlier.
This means that you can use the matchPattern() function for your
problem.
> m2 <- matchPattern(samp[[2]], ref, max.mismatch=3)
> m2
Views on a 19-letter DNAString subject
subject: ACTTCACCAGCTCCCTGGC
views:
start end width
[1] 1 18 18 [ACTTCACCAGCTCCCTGG]
Then use the mismatch() function to get the position(s) of the
mismatche(s):
> mismatch(samp[[2]], m2)[[1]]
[1] 6 13
Cheers,
H.
Wolfgang Raffelsberger wrote:
Dear list,
I would like to count the occurrence of (mostly single) nucleotide
polymorphisms from nucleotide sequences.
I got across the Biostrings package and pairwiseAlignment() that allows
me to get closer to what I want but
1) I noticed that the score produced from pairwiseAlignment() is quite
different to other implementations of the Needlaman-Wunsch alogorithm
(eg in EMBOSS)
2) the score is not directly the information I 'm looking for since
it's
a mixture of the gaps & mismatches (and I don't see if/how one could
modify that).
However, I would primarily be interested in finding where a given
nucleotide differs from the query (from a pairwise alignment) to some
statistics on them, ie at which position I get which other element
instead. Note, that my sample-sequences may start or end slightly
later/earlier.
Any suggestions ?
Sample code might look like (of course, my real sequences are longer
...):
ref <- DNAString("ACTTCACCAGCTCCCTGGC")
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
# the 3rd one has no mutations, it's simply shorter ...
pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat, gapOpening
= -5, gapExtension = -2)
alignScores <- numeric()
for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]],
substitutionMatrix = mat, gapOpening = -5, gapExtension = -2,
scoreOnly=T)
alignScores # the 3rd sequence without mismatches gets worst score
(Based on a previous post on BioC) I just subscribed to
[email protected], but I don't know if I don't mange to
search the previous mail archives (on http://search.gmane.org/) since I
keep getting (general) Bioconductor messages.
Thank's in advance,
Wolfgang
By the way, if that matters, I'm (still) running R-2.7.2
sessionInfo()
R version 2.7.2 (2008-08-25)
i386-pc-mingw32
locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices datasets tcltk utils
methods base other attached packages:
[1] Biostrings_2.8.18 svSocket_0.9-5 svIO_0.9-5
R2HTML_1.59 svMisc_0.9-5 svIDE_0.9-5 loaded via a
namespace (and not attached):
[1] tools_2.7.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC
1 rue Laurent Fries, 67404 Illkirch Strasbourg, France
Tel (+33) 388 65 3300 Fax (+33) 388 65 3276
[EMAIL PROTECTED]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing