Re: [Bioc-sig-seq] trimLRPatterns vs vmatchPattern

Harris A. Jaffee Tue, 26 Jan 2010 06:56:28 -0800

You can get the with.Lindels=TRUE behavior you want by reversing yourdataand calling trimLRPatterns with with.Rindels=TRUE and an Rpatternequal to

the reverse of your Lpattern, and then reversing the results.

But the behavior you like with indels might be considered a bug, orat leastsomething that trimLRPatterns shouldn't do, in the sense that thematch youlike isn't flanking, so might be artificial. Also, setting themax.mismatchvalues is difficult because one can't know in advance how manyletters of thecurrent pattern will extend off the edge of the subject, whichdepends on thematch, i.e. how many inserts into the pattern will be required.Worse, "the"match-with-indels is not at all unique. In my mind, it also isn'tclear thatHerve's 'best local match' (shortest one with the minimal editdistance) iswhat trimLRPatterns wants. It might want the longest match with theminimaledit distance, the longest match with maximal allowed edit distance,etc.

I have a version that demands that matches-with-indels start at thebeginningof the subject, for Lpattern, and end at the end of the subject, forRpattern.[The ending part would also not be implemented yet, if I neededProffset, butI do it by the reversing trick.] I trim by the length of the currentpatternrather than the length of the match. This is a one-size-fits-allsolution,since the match could be shorter or longer. Maybe I should trim bythe lengthof the pattern plus the number of allowed edits. The max.mismatchlimits are

less circular, so easier to set.

I was about to submit a patch, but now it's open to debate.

On Jan 25, 2010, at 9:54 PM, Marcus Davy wrote:

Hi,
I have some 454 data which is expected to contain a primer at thebeginning.
I noticed that using trimLRPatterns to trim an Lpattern at the 5'start of asequence with 1 mismatch does not allow matches to the subjectsequence upto the number of mismatches directly to the right of the sequencestart (and
also to the left of an Rpattern at the 3' end).
So if my example primer Lpattern is 23 bases in length,trimLRPatterns with1 mismatch will match all sequences at positions 0 to 22 and 1 to23, butnot 2 to 24 whereas if you use vmatchPattern, it will find allthree of
these positions.
My question is should an option/arg in trimLRPatterns be madeavailable thatallows matches to the right of the Lpattern up to the number ofmismatches,
and to the left of the Rpattern up to the number of mismatches?
I think the arg e.g. 'with.Lindels=TRUE' (which appears to have notbeen
enabled yet in Biostrings_2.14.8) may partially resolve this if the
insertion is somewhere at the beginning of a subject sequence(within thenumber of mismatches allowed) which can then be matched by theLpattern.
mismatches <- 1
pattern    <- "AAGCAGTGGTATCAACGCAGAGT"
w          <- width(pattern)
mm         <- rep(mismatches, w)
base       <- "G"
n          <- 20

subjectList <- list(
+ "Primer0+poly(T)" = paste(substring(pattern,2,w),
polyn(base, n), sep=""),
+ "Primer1+poly(T)" = paste(pattern, polyn(base, n),
sep=""),
+ "Primer2+poly(T)" = paste("A", substring(pattern,1,w),
polyn(base,n), sep="")
+                     )
subjectSet <- DNAStringSet(unlist(subjectList))

print(subjectSet)
  A DNAStringSet instance of length 3
    width seq                                               names
[1] 42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer0+poly(T)[2] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer1+poly(T)[3] 44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer2+poly(T)
cat("Primer:  ", pattern, "\n")
Primer:  AAGCAGTGGTATCAACGCAGAGT
## trimLRPatterns -LPattern
LRcoords <- trimLRPatterns(Lpattern = pattern, subject =subjectSet,
 max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
## Fails to match subjectSet[3:4] - trimming with mismatches is tothe
left of Lpattern (and to the right of Rpattern)
DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
  A DNAStringSet instance of length 3
    width seq                                               names
[1] 20 GGGGGGGGGGGGGGGGGGGG Primer0+poly(T)[2] 20 GGGGGGGGGGGGGGGGGGGG Primer1+poly(T)[3] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer2+poly(T)
## vmatchPattern
tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
matchIndex <- which(as.logical(countIndex(tmp)))
## only one match per sequence so indices preserved with unlist
VMcoords <- unlist(tmp)
## Matches to subjectSet[3]
DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
width(subjectSet)[matchIndex])
  A DNAStringSet instance of length 3
    width seq                                               names
[1] 20 GGGGGGGGGGGGGGGGGGGG Primer0+poly(T)[2] 20 GGGGGGGGGGGGGGGGGGGG Primer1+poly(T)[3] 20 GGGGGGGGGGGGGGGGGGGG Primer2+poly(T)
cheers,


 Marcus
sessionInfo()
R version 2.10.0 (2009-10-26)
powerpc-apple-darwin8.11.1

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
[4] ShortRead_1.4.0     lattice_0.17-26     BSgenome_1.14.2
[7] Biostrings_2.14.8   IRanges_1.4.9

loaded via a namespace (and not attached):
[1] DBI_0.2-4 RSQLite_0.7-3 grid_2.10.0 hwriter_1.1tools_2.10.0
        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] trimLRPatterns vs vmatchPattern

Reply via email to