Hi Marcus,

On 08/09/2010 09:14 PM, Marcus Davy wrote:
Hi,

I am using vmatchPattern, to match a pattern against many subjects, I would
like to count the actual numbers
of mismatched bases for each successful match, where max.mismatch is set to
be>0.
I have found the functions 'nmatch', and 'nmismatch' which work with
matchPattern,

e.g.
pattern<- "AATTCGGCACGAGG"
subject<- DNAString("TTCGGCACGAGGTT")
tmp<- matchPattern(pattern, subject, max.mismatch=2)
## Two mismatches
print(nmismatch(toString(pattern), tmp))

I have had a go at extending this for unique matches with vmatchPattern,
storing the mismatches in a vector,
but a vector does not capture multiple matches between a pattern and a
subject, the output needs to be a list.

pattern<- DNAStringSet("AATTCGGCACGAGG")

Betters to make pattern a DNAString object:

  pattern <- DNAString("AATTCGGCACGAGG")

subjects<- DNAStringSet(c(
                              "match-1" = "TTCGGCACGAGGTT",
                              "match 0" = "ATTCGGCACGAGGT",
                              "match+1" = "AATTCGGCACGAGG",
                              "match+2" = "TAATTCGGCACGAG",
                              "match+3" = "TTAATTCGGCACGA",
                              "match+1 and +15" =
"AATTCGGCACGAGGAATTCGGCACGAGG"
                              )
                            )
print(subjects)

tmp<- vmatchPattern(toString(pattern), subjects, max.mismatch=2)

toString() not needed anymore because pattern is a DNAString.

print(startIndex(tmp))
ind<- which(countIndex(tmp)==1)
mismatches<- rep(NA, length(subjects))

for(i in seq(subjects)) {
   viewsList<- extractAllMatches(subjects[[i]], tmp)[i]
   mismatches[i]<- nmismatch(toString(pattern), viewsList)
}

It looks to me that

   viewsList<- extractAllMatches(subjects[[i]], tmp)[i]

won't do what you want in the general case. This call to
extractAllMatches():

  extractAllMatches(subjects[[i]], tmp)

converts *all* the matches stored in 'tmp' into views on
subjects[[i]], even the matches that don't hit 'subjects[[i]]'.
In fact, extractAllMatches() is not intended to be used on
an MIndex object returned by vmatchPattern() but only on
an MIndex object returned by matchPDict() (here all the matches
belong to the same subject).

Because of this, when you extract the i-th view from this Views
object with

  extractAllMatches(subjects[[i]], tmp)[i]

you are not guaranteed to get a valid view (i.e. a view that
corresponds to a match that actually hits the i-th subject).
It works only by chance in your case because all the subjects
before the i-th subject have exactly 1 hit.

A better way to generate the Views object for the i-th subject
is:

  Views(subjects[[i]], tmp[[i]])

So:

  for(i in seq(subjects)) {
     viewsList<- Views(subjects[[i]], tmp[[i]])
     mismatches[i] <- nmismatch(pattern, viewsList)
  }


## Second match at +15 in subjects[[6]] is never taken into account
print(mismatches)

'mismatches' needs to be a list:

  mismatches <- vector("list", length(subjects))
  for(i in seq(subjects)) {
     viewsList<- Views(subjects[[i]], tmp[[i]])
     mismatches[[i]] <- nmismatch(pattern, viewsList)
  }

Note the use of [[ instead of [ in mismatches[[i]]
Then:

  > mismatches
  [[1]]
  [1] 2

  [[2]]
  [1] 1

  [[3]]
  [1] 0

  [[4]]
  [1] 1

  [[5]]
  [1] 2

  [[6]]
  [1] 0 0


Incidentally this sapply equivalent was failing for me, I haven't yet solved
why.
## sapply fails...
sapply(seq(subjects), function(y){ nmismatch(toString(pattern),
extractAllMatches(subjects[[y]], tmp)[y])})

Better to use lapply() because you want the result in a list, not in
an integer vector. But it seems that nmismatch() can't be used in that
context:

  > lapply(seq(subjects),
           function(i)
             nmismatch(pattern, Views(subjects[[i]], tmp[[i]])))
  Error in checkAndTranslateDbleBracketSubscript(x, i) :
    object 'i' not found

This is clearly a bug that has to do with how nmismatch() is
implemented:

  > selectMethod("nmismatch", c("ANY", "XStringViews"))
  Method Definition:

  function (pattern, x, fixed = TRUE)
  {
    funCall <- match.call()
    funCall[[1]] <- as.name("mismatch")
    mismatches <- eval(funCall, sys.parent())
    elementLengths(mismatches)
  }

I'll fix it.


Is there a obvious way to extend these functions to vmatchPattern output of
class MIndex, ideally a list in the
same structural form as startIndex/endIndex? Is this an additional measure
that vmatchPattern could also return?

Yes, that is a fair request. I'll add something like this. Maybe an
nmismatchIndex() function that would return a list with the same
shape as the list returned by startIndex/endIndex. I can't really
add a "nmismatch" method for the c(pattern"ANY", x="MIndex") because
the reference sequences need to be supplied but they are not stored
in the MIndex object so an extra argument would be needed.

I'm open to suggestions.

Cheers,
H.



cheers,


Marcus


sessionInfo()
R version 2.11.1 (2010-05-31)
x86_64-apple-darwin9.8.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.16.9 IRanges_1.6.11    NGS_0.9.1

loaded via a namespace (and not attached):
[1] Biobase_2.8.0 tools_2.11.1

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to