Dear Patrick,

thank's for your response. After looking in the nice functions that you pointed out now I have three new questions:

1) Using commands like mismatchTable or consensusMatrix, is there a way to I tell/recall which of my sample-sequences has given alterations in the pairwise alignments. This might correspond to a list for each line of mismatchSummary(alignments)$subject tells which of the sample-sequence contributed to the effect counted in mismatchSummary(alignments)$subject$Count . Or are there some other ways/commands allowing to do so ?

2) I noticed that I get the same paire-wise alignment results (even same score) when using +5 insted of -5 for gapOpening etc. Why bother about writing values as negative if they're taken as abs() ?

3) I don't see insertions displayed with consensusMatrix(), ie the "+" lines stays empty while deletions ("-") are counted correctly (and I didn't get insertion() to work). My example sample-sequences no 5 & 6 have insertions.

Thank's in advance,
Wolfgang


library(Biostrings)

mat <- matrix(-3,nc=5,nr=5)
diag(mat) <- 1
mat[,5] <- 0
mat[5,] <- 0
rownames(mat) <- colnames(mat) <- c("A","C","G","T","N")

ref <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAA")
samp <- DNAStringSet(c("CTTCtCCAGCTCCCTGGCGGTAAGTTGA","ACTTCtCCAGCTagCTGnnGGTAAGTTGA",
"TTCACCAGCTCCCTGCGGGTAAGTT","TTCACGCTcCTGGGTAAGGATCA",

"cACTTCACCAGCTCCCTGGCGGTAAGT","nnnnTTCtCCAGCTCCCccTGGCGatngGTAAGTTGATCtcgt")) # samp: no1: single mut, no2: mult mut, no3: just shorter, no4: deletions, no5: insertion, no6: mult insertions

alignments <- pairwiseAlignment(samp, ref, substitutionMatrix=mat,
 gapOpening= -5, gapExtension= -2, scoreOnly=FALSE, type="global")

# test if abs() values for gapOpening change anything
alignments2 <- pairwiseAlignment(samp, ref, substitutionMatrix=mat,
 gapOpening= 5, gapExtension= 2, scoreOnly=FALSE, type="global")
identical(alignments,alignments2)     # is TRUE

# check for deletions
consensusMatrix(alignments, baseOnly = F)[c(1:4,15:17),]
# the last line "+" doesn't show/count any insertions while deletions "-" are counted OK




> sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32

locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.10.0 IRanges_1.0.1 loaded via a namespace (and not attached):
[1] grid_2.8.0         lattice_0.17-15    Matrix_0.999375-16




Patrick Aboyoun a écrit :
Wolfgang,
I'm glad to hear you are finding the Biostrings package to be useful. The pairwiseAlignment function has some helper functions that would be useful to a SNP analysis. In particular, you may want to take a look at the mismatchTable, mismatchSummary, summary, and consensusMatrix functions to see if they provide the summaries you are looking for. I also recommend you leverage the vectorized capabilities of the pairwiseAlignment function since it will be easier to maintain your code, provide faster execution time, and make it easier to summarize your results. I have modified your code slightly to show you what you can get "out of the box" from the Biostrings package without writing too much code:


library(Biostrings)
# the substitution matrix
mat <- matrix(-3, nrow = 5, ncol = 5)
diag(mat) <- 1
mat[,5] <- 0; mat[5,] <- 0
rownames(mat) <- colnames(mat) <- DNA_ALPHABET[c(1:4,15)]

# the consensus reference sequence
ref <- DNAString("ACTTCACCAGCTCCCTGGC")
# the sample sequences; the 3rd one has no mutations, it's simply shorter ...
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))

# vectorized alignments using the ref as the "subject"
alignments <-
 pairwiseAlignment(samp, ref, substitutionMatrix=mat,
   gapOpening= -5, gapExtension= -2, scoreOnly=FALSE,
   type="global")

# summaries useful for SNP analysis
summary(alignments)
mismatchSummary(alignments)
mismatchTable(alignments)
consensusMatrix(alignments, baseOnly = TRUE)



Patrick


Wolfgang Raffelsberger wrote:
Hi Herve,

thank you for your helpful answer.
Initially I was wondering if there are already some functions/packages
directly for my tasks available, as Bioconducter and CRAN are very
strong resources.

Since your Biostrings package provides all the basic functionalities,
I've built a few (additional) functions (on top of Biostrings) doing the
job.
To be more precise, yes, I'm interested in detecting all alterations
(mutations+ insertions + deletions). However, within the project's
biological context mutations are the prime focus. Right now I'm rather
in an exploratory/pilot phase for the nucleotide-alteration/SNP
counting, so I'll probably improve & expand some of my quickly written code.

For the sequence alignments I prefer global, since I'd rather want the
'best' match possible (ie the default setting with pairwiseAlignment()
).  Concerning the choice of (custom) substitution matrices I'm not very
decided.  Right now I'm using very generic ones (like in the Biostrings
vignette). Recently I've learned that there are several 'versions' of
the Needleman-Wunsch algorithm (ie various improvements to the original
one), so I'm not any more surprised of seeing different scores from
different implementations...
Anyway, once it's clear which sample/target (ie sequencer output) should
be compared with which reference sequence (ie the consenus genomic
sequence of the very region to be tested) I don't use the score any more
and luckily my test-data contain few completely unexpected 'alien'
sequences.  For this initial step now I use matchPattern() - thank's for
your hint- where I primarily need to test if sequences have to be read
as inverse/complement.
Again, after a bit more digging I found the corresponding functions and
I'm quite happy with the performance !

At the core of SNP/alterations counting I dissect the pairwise alignment
and record what kind of changes have happened (at this level I noted
some changes to the new BioC 2.3 implementation ...see also code below).
Here I use (so far) strsplit() to make two vectors with individual
nucleotides (and gaps) to extract the positions&changes from each of the
lines of the pairwise alignment (see code below). I guess there may be
other more elegant/faster ways to do this...
As I'm primarily interested in 'simple mutations', so far I took a
shortcut consisting in simply looking at the first of inserted
nucleotide in cases of insertions (some additional lines not part of the
code below). Maybe later I'll get back to expand this ... After all I
end up with a simple matrix where each row corresponds to a position of
the reference sequence and the columns contain counts of the
events/alterations observed (where at least I'm able to tell if any
insertions have happened at all) where I can run classical statistical
tests.

Cheers,
Wolfgang

# here the way I 'parse' pairwise alignments :

library(Biostrings)
mat <- matrix(-3, nrow = 5, ncol = 5)
diag(mat) <- 1
mat[,5] <- 0; mat[5,] <- 0
rownames(mat) <- colnames(mat) <- DNA_ALPHABET[c(1:4,15)]

ref <- DNAString("ACTTCACCAGCTCCCTGGC")                # the consensus
reference sequence
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG"))
# the 3rd one has no mutations, it's simply shorter ...
# here (simplified) I already know that my sequence is oriented correctly
alignRes <- matrix(nr=nchar(ref),nc=length(samp))
for(i in 1:length(samp)) {
  x <- pairwiseAlignment(as.character(ref), samp[[i]],
substitutionMatrix= mat, gapOpening= -5, gapExtension= -2,scoreOnly=F,
type="global")
  cutsamp <- unlist(strsplit(as.character([EMAIL PROTECTED]),""))
  cutref <- unlist(strsplit(as.character([EMAIL PROTECTED]),""))
  if(nchar(ref) != length(cutref)) {
    origRefNt <- unlist(strsplit(as.character(ref),""))
    offS <- 1
    while(origRefNt[1+offS] != cutref[1]) offS <- offS+1
  }
  out <- rep("Z",nchar(ref))                                    # use
"Z" for (default) constant
  if( !identical(cutsamp,cutref)) {
    mutPos <- !(cutref== cutsamp)
    out[mutPos+offS] <- cutsamp[mutPos+offS]
    if(sum(cutref=="-")>0) out <- out[-which(cutref=="-")]    # remove
gaps in reference
    # gaps could/can be retained at this stage only to a limited degree
by additional subsitutions
  }
 alignRes[,i] <- out
}

sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32

locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.10.0 IRanges_1.0.1

loaded via a namespace (and not attached):
[1] grid_2.8.0         lattice_0.17-15    Matrix_0.999375-16



Herve Pages a écrit :
Hi Wolfgang,

If I understand correctly you want to compare a collection of "sample"
sequences against a "reference" sequence.

There are many ways to "compare" 2 sequences and to get a score from this comparison: Do you want to allow indels or just substitutions (aka replacements)? Do you want the alignment to be global or local? Do you want the score to be
determined by a custom substitution matrix or just to reflect the "edit
distance" between the 2 sequences (i.e. the minimum number of replacements + insertions + deletions required to transform sequence 1 into sequence 2).

If you are trying to get the edit distance between the sample sequences
and the reference sequence then you might want to look at this post:

https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2008-July/000067.html

Note that the 'type' argument is omitted in this post so pairwiseAlignment() is used to perform "global" alignments, that is, the score that is returned indicates the min nb of edit operations needed to transform the *complete* sequences in 'pattern' (your "sample" sequences) into the *complete* 'subject'
(your "reference" sequence).
One nice property of "global" alignment is that it is a symetric operation on the 2 sequences to align. You can take advantage of this by inverting the roles of the "sample" and the "reference" sequences so 'samp' can be the pattern
and 'ref' the subject. Then you can benefit from the fact that
pairwiseAlignment() is vectorized on the pattern.

But then you say that you would primarily be interested in finding where a given nucleotide differs from the query. This suggests that you are in fact looking for a much simpler form of alignment where only mismatches (i.e.
replacements) are allowed but no indels. Also you say that your
sample-sequences may start or end slightly later/earlier.
This means that you can use the matchPattern() function for your problem.

  > m2 <- matchPattern(samp[[2]], ref, max.mismatch=3)
  > m2
    Views on a 19-letter DNAString subject
  subject: ACTTCACCAGCTCCCTGGC
  views:
      start end width
  [1]     1  18    18 [ACTTCACCAGCTCCCTGG]

Then use the mismatch() function to get the position(s) of the mismatche(s):

  > mismatch(samp[[2]], m2)[[1]]
  [1]  6 13

Cheers,
H.


Wolfgang Raffelsberger wrote:
Dear list,

I would like to count the occurrence of (mostly single) nucleotide
polymorphisms from nucleotide sequences.
I got across the Biostrings package and pairwiseAlignment() that allows
me to get closer to what I want but
1) I noticed that the score produced from pairwiseAlignment() is quite
different to other implementations of the Needlaman-Wunsch alogorithm
(eg in EMBOSS)
2) the score is not directly the information I 'm looking for since it's
a mixture of the gaps & mismatches (and I don't see if/how one could
modify that).

However, I would primarily be interested in finding where a given
nucleotide differs from the query (from a pairwise alignment) to some
statistics on them, ie at which position I get which other element
instead. Note, that my sample-sequences may start or end slightly
later/earlier.
Any suggestions ?

Sample code might look like (of course, my real sequences are longer ...):

ref <- DNAString("ACTTCACCAGCTCCCTGGC")
samp <-
DNAStringSet(c("CTTCTCCAGCTCCCTGG","ACTTCTCCAGCTACCTGG","TTCACCAGCTCCCTG")) # the 3rd one has no mutations, it's simply shorter ... pairwiseAlignment(ref, samp[[1]], substitutionMatrix = mat, gapOpening
= -5, gapExtension = -2)
alignScores <- numeric()
for(i in 1:3) alignScores[i] <- pairwiseAlignment(ref, samp[[i]],
substitutionMatrix = mat, gapOpening = -5, gapExtension = -2, scoreOnly=T)
alignScores     # the 3rd sequence without mismatches gets worst score


(Based on a previous post on BioC) I just subscribed to
[email protected], but I don't know if I don't mange to search the previous mail archives (on http://search.gmane.org/) since I
keep getting (general) Bioconductor messages.

Thank's in advance,
Wolfgang


By the way, if that matters, I'm (still) running R-2.7.2
sessionInfo()
R version 2.7.2 (2008-08-25)
i386-pc-mingw32

locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252


attached base packages:
[1] stats graphics grDevices datasets tcltk utils methods base other attached packages: [1] Biostrings_2.8.18 svSocket_0.9-5 svIO_0.9-5 R2HTML_1.59 svMisc_0.9-5 svIDE_0.9-5 loaded via a namespace (and not attached):
[1] tools_2.7.2



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC 1 rue Laurent Fries, 67404 Illkirch Strasbourg, France
Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
[EMAIL PROTECTED]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to