Re: [Bioc-sig-seq] Counting mismatches with vmatchPattern

Marcus Davy Mon, 16 Aug 2010 01:40:13 -0700

Hi Hervé,

thanks for your comprehensive reply, that is really helpful. Yes, I think a


nmismatchIndex() function that returns a list of actual mismatches would be
very useful.


cheers,



Marcus



==working example below==


pattern    <- DNAString("AATTCGGCACGAGG")

subjects   <- DNAStringSet(c(

                             "match-1" = "TTCGGCACGAGGTT",

                             "match 0" = "ATTCGGCACGAGGT",

                             "match+1" = "AATTCGGCACGAGG",

                             "match+2" = "TAATTCGGCACGAG",

                             "match+3" = "TTAATTCGGCACGA",

                             "match+1 and +15" =
"AATTCGGCACGAGGAATTCGGCACGAGG"

                             )

                           )

print(subjects)


tmp        <- vmatchPattern(pattern, subjects, max.mismatch=2)

print(startIndex(tmp))


mismatches <- vector("list", length(subjects))


for(i in seq(subjects)) {

  viewsList<- Views(subjects[[i]], tmp[[i]])

  mismatches[[i]] <- nmismatch(pattern, viewsList)

}

print(mismatches)


2010/8/13 Hervé Pagès <[email protected]>

> Hi Marcus,
>
>
> On 08/09/2010 09:14 PM, Marcus Davy wrote:
>
>> Hi,
>>
>> I am using vmatchPattern, to match a pattern against many subjects, I
>> would
>> like to count the actual numbers
>> of mismatched bases for each successful match, where max.mismatch is set
>> to
>> be>0.
>> I have found the functions 'nmatch', and 'nmismatch' which work with
>> matchPattern,
>>
>> e.g.
>> pattern<- "AATTCGGCACGAGG"
>> subject<- DNAString("TTCGGCACGAGGTT")
>> tmp<- matchPattern(pattern, subject, max.mismatch=2)
>> ## Two mismatches
>> print(nmismatch(toString(pattern), tmp))
>>
>> I have had a go at extending this for unique matches with vmatchPattern,
>> storing the mismatches in a vector,
>> but a vector does not capture multiple matches between a pattern and a
>> subject, the output needs to be a list.
>>
>> pattern<- DNAStringSet("AATTCGGCACGAGG")
>>
>
> Betters to make pattern a DNAString object:
>
>  pattern <- DNAString("AATTCGGCACGAGG")
>
>
>  subjects<- DNAStringSet(c(
>>                              "match-1" = "TTCGGCACGAGGTT",
>>                              "match 0" = "ATTCGGCACGAGGT",
>>                              "match+1" = "AATTCGGCACGAGG",
>>                              "match+2" = "TAATTCGGCACGAG",
>>                              "match+3" = "TTAATTCGGCACGA",
>>                              "match+1 and +15" =
>> "AATTCGGCACGAGGAATTCGGCACGAGG"
>>                              )
>>                            )
>> print(subjects)
>>
>> tmp<- vmatchPattern(toString(pattern), subjects, max.mismatch=2)
>>
>
> toString() not needed anymore because pattern is a DNAString.
>
>
>  print(startIndex(tmp))
>> ind<- which(countIndex(tmp)==1)
>> mismatches<- rep(NA, length(subjects))
>>
>> for(i in seq(subjects)) {
>>   viewsList<- extractAllMatches(subjects[[i]], tmp)[i]
>>   mismatches[i]<- nmismatch(toString(pattern), viewsList)
>> }
>>
>
> It looks to me that
>
>
>   viewsList<- extractAllMatches(subjects[[i]], tmp)[i]
>
> won't do what you want in the general case. This call to
> extractAllMatches():
>
>  extractAllMatches(subjects[[i]], tmp)
>
> converts *all* the matches stored in 'tmp' into views on
> subjects[[i]], even the matches that don't hit 'subjects[[i]]'.
> In fact, extractAllMatches() is not intended to be used on
> an MIndex object returned by vmatchPattern() but only on
> an MIndex object returned by matchPDict() (here all the matches
> belong to the same subject).
>
> Because of this, when you extract the i-th view from this Views
> object with
>
>
>  extractAllMatches(subjects[[i]], tmp)[i]
>
> you are not guaranteed to get a valid view (i.e. a view that
> corresponds to a match that actually hits the i-th subject).
> It works only by chance in your case because all the subjects
> before the i-th subject have exactly 1 hit.
>
> A better way to generate the Views object for the i-th subject
> is:
>
>  Views(subjects[[i]], tmp[[i]])
>
> So:
>
>
>  for(i in seq(subjects)) {
>     viewsList<- Views(subjects[[i]], tmp[[i]])
>     mismatches[i] <- nmismatch(pattern, viewsList)
>
>  }
>
>
>> ## Second match at +15 in subjects[[6]] is never taken into account
>> print(mismatches)
>>
>
> 'mismatches' needs to be a list:
>
>  mismatches <- vector("list", length(subjects))
>
>  for(i in seq(subjects)) {
>     viewsList<- Views(subjects[[i]], tmp[[i]])
>     mismatches[[i]] <- nmismatch(pattern, viewsList)
>  }
>
> Note the use of [[ instead of [ in mismatches[[i]]
> Then:
>
>  > mismatches
>  [[1]]
>  [1] 2
>
>  [[2]]
>  [1] 1
>
>  [[3]]
>  [1] 0
>
>  [[4]]
>  [1] 1
>
>  [[5]]
>  [1] 2
>
>  [[6]]
>  [1] 0 0
>
>
>
>> Incidentally this sapply equivalent was failing for me, I haven't yet
>> solved
>> why.
>> ## sapply fails...
>> sapply(seq(subjects), function(y){ nmismatch(toString(pattern),
>> extractAllMatches(subjects[[y]], tmp)[y])})
>>
>
> Better to use lapply() because you want the result in a list, not in
> an integer vector. But it seems that nmismatch() can't be used in that
> context:
>
>  > lapply(seq(subjects),
>           function(i)
>             nmismatch(pattern, Views(subjects[[i]], tmp[[i]])))
>  Error in checkAndTranslateDbleBracketSubscript(x, i) :
>    object 'i' not found
>
> This is clearly a bug that has to do with how nmismatch() is
> implemented:
>
>  > selectMethod("nmismatch", c("ANY", "XStringViews"))
>  Method Definition:
>
>  function (pattern, x, fixed = TRUE)
>  {
>    funCall <- match.call()
>    funCall[[1]] <- as.name("mismatch")
>    mismatches <- eval(funCall, sys.parent())
>    elementLengths(mismatches)
>  }
>
> I'll fix it.
>
>
>
>> Is there a obvious way to extend these functions to vmatchPattern output
>> of
>> class MIndex, ideally a list in the
>> same structural form as startIndex/endIndex? Is this an additional measure
>> that vmatchPattern could also return?
>>
>
> Yes, that is a fair request. I'll add something like this. Maybe an
> nmismatchIndex() function that would return a list with the same
> shape as the list returned by startIndex/endIndex. I can't really
> add a "nmismatch" method for the c(pattern"ANY", x="MIndex") because
> the reference sequences need to be supplied but they are not stored
> in the MIndex object so an extra argument would be needed.
>
> I'm open to suggestions.
>
> Cheers,
> H.
>
>
>>
>> cheers,
>>
>>
>> Marcus
>>
>>
>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] Biostrings_2.16.9 IRanges_1.6.11    NGS_0.9.1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0 tools_2.11.1
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: [email protected]
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Counting mismatches with vmatchPattern

Reply via email to