Hi Hervé,
thanks for your comprehensive reply, that is really helpful. Yes, I think a
nmismatchIndex() function that returns a list of actual mismatches would be
very useful.
cheers,
Marcus
==working example below==
pattern <- DNAString("AATTCGGCACGAGG")
subjects <- DNAStringSet(c(
"match-1" = "TTCGGCACGAGGTT",
"match 0" = "ATTCGGCACGAGGT",
"match+1" = "AATTCGGCACGAGG",
"match+2" = "TAATTCGGCACGAG",
"match+3" = "TTAATTCGGCACGA",
"match+1 and +15" =
"AATTCGGCACGAGGAATTCGGCACGAGG"
)
)
print(subjects)
tmp <- vmatchPattern(pattern, subjects, max.mismatch=2)
print(startIndex(tmp))
mismatches <- vector("list", length(subjects))
for(i in seq(subjects)) {
viewsList<- Views(subjects[[i]], tmp[[i]])
mismatches[[i]] <- nmismatch(pattern, viewsList)
}
print(mismatches)
2010/8/13 Hervé Pagès <[email protected]>
> Hi Marcus,
>
>
> On 08/09/2010 09:14 PM, Marcus Davy wrote:
>
>> Hi,
>>
>> I am using vmatchPattern, to match a pattern against many subjects, I
>> would
>> like to count the actual numbers
>> of mismatched bases for each successful match, where max.mismatch is set
>> to
>> be>0.
>> I have found the functions 'nmatch', and 'nmismatch' which work with
>> matchPattern,
>>
>> e.g.
>> pattern<- "AATTCGGCACGAGG"
>> subject<- DNAString("TTCGGCACGAGGTT")
>> tmp<- matchPattern(pattern, subject, max.mismatch=2)
>> ## Two mismatches
>> print(nmismatch(toString(pattern), tmp))
>>
>> I have had a go at extending this for unique matches with vmatchPattern,
>> storing the mismatches in a vector,
>> but a vector does not capture multiple matches between a pattern and a
>> subject, the output needs to be a list.
>>
>> pattern<- DNAStringSet("AATTCGGCACGAGG")
>>
>
> Betters to make pattern a DNAString object:
>
> pattern <- DNAString("AATTCGGCACGAGG")
>
>
> subjects<- DNAStringSet(c(
>> "match-1" = "TTCGGCACGAGGTT",
>> "match 0" = "ATTCGGCACGAGGT",
>> "match+1" = "AATTCGGCACGAGG",
>> "match+2" = "TAATTCGGCACGAG",
>> "match+3" = "TTAATTCGGCACGA",
>> "match+1 and +15" =
>> "AATTCGGCACGAGGAATTCGGCACGAGG"
>> )
>> )
>> print(subjects)
>>
>> tmp<- vmatchPattern(toString(pattern), subjects, max.mismatch=2)
>>
>
> toString() not needed anymore because pattern is a DNAString.
>
>
> print(startIndex(tmp))
>> ind<- which(countIndex(tmp)==1)
>> mismatches<- rep(NA, length(subjects))
>>
>> for(i in seq(subjects)) {
>> viewsList<- extractAllMatches(subjects[[i]], tmp)[i]
>> mismatches[i]<- nmismatch(toString(pattern), viewsList)
>> }
>>
>
> It looks to me that
>
>
> viewsList<- extractAllMatches(subjects[[i]], tmp)[i]
>
> won't do what you want in the general case. This call to
> extractAllMatches():
>
> extractAllMatches(subjects[[i]], tmp)
>
> converts *all* the matches stored in 'tmp' into views on
> subjects[[i]], even the matches that don't hit 'subjects[[i]]'.
> In fact, extractAllMatches() is not intended to be used on
> an MIndex object returned by vmatchPattern() but only on
> an MIndex object returned by matchPDict() (here all the matches
> belong to the same subject).
>
> Because of this, when you extract the i-th view from this Views
> object with
>
>
> extractAllMatches(subjects[[i]], tmp)[i]
>
> you are not guaranteed to get a valid view (i.e. a view that
> corresponds to a match that actually hits the i-th subject).
> It works only by chance in your case because all the subjects
> before the i-th subject have exactly 1 hit.
>
> A better way to generate the Views object for the i-th subject
> is:
>
> Views(subjects[[i]], tmp[[i]])
>
> So:
>
>
> for(i in seq(subjects)) {
> viewsList<- Views(subjects[[i]], tmp[[i]])
> mismatches[i] <- nmismatch(pattern, viewsList)
>
> }
>
>
>> ## Second match at +15 in subjects[[6]] is never taken into account
>> print(mismatches)
>>
>
> 'mismatches' needs to be a list:
>
> mismatches <- vector("list", length(subjects))
>
> for(i in seq(subjects)) {
> viewsList<- Views(subjects[[i]], tmp[[i]])
> mismatches[[i]] <- nmismatch(pattern, viewsList)
> }
>
> Note the use of [[ instead of [ in mismatches[[i]]
> Then:
>
> > mismatches
> [[1]]
> [1] 2
>
> [[2]]
> [1] 1
>
> [[3]]
> [1] 0
>
> [[4]]
> [1] 1
>
> [[5]]
> [1] 2
>
> [[6]]
> [1] 0 0
>
>
>
>> Incidentally this sapply equivalent was failing for me, I haven't yet
>> solved
>> why.
>> ## sapply fails...
>> sapply(seq(subjects), function(y){ nmismatch(toString(pattern),
>> extractAllMatches(subjects[[y]], tmp)[y])})
>>
>
> Better to use lapply() because you want the result in a list, not in
> an integer vector. But it seems that nmismatch() can't be used in that
> context:
>
> > lapply(seq(subjects),
> function(i)
> nmismatch(pattern, Views(subjects[[i]], tmp[[i]])))
> Error in checkAndTranslateDbleBracketSubscript(x, i) :
> object 'i' not found
>
> This is clearly a bug that has to do with how nmismatch() is
> implemented:
>
> > selectMethod("nmismatch", c("ANY", "XStringViews"))
> Method Definition:
>
> function (pattern, x, fixed = TRUE)
> {
> funCall <- match.call()
> funCall[[1]] <- as.name("mismatch")
> mismatches <- eval(funCall, sys.parent())
> elementLengths(mismatches)
> }
>
> I'll fix it.
>
>
>
>> Is there a obvious way to extend these functions to vmatchPattern output
>> of
>> class MIndex, ideally a list in the
>> same structural form as startIndex/endIndex? Is this an additional measure
>> that vmatchPattern could also return?
>>
>
> Yes, that is a fair request. I'll add something like this. Maybe an
> nmismatchIndex() function that would return a list with the same
> shape as the list returned by startIndex/endIndex. I can't really
> add a "nmismatch" method for the c(pattern"ANY", x="MIndex") because
> the reference sequences need to be supplied but they are not stored
> in the MIndex object so an extra argument would be needed.
>
> I'm open to suggestions.
>
> Cheers,
> H.
>
>
>>
>> cheers,
>>
>>
>> Marcus
>>
>>
>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.16.9 IRanges_1.6.11 NGS_0.9.1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0 tools_2.11.1
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: [email protected]
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing