You can get the with.Lindels=TRUE behavior you want by reversing your data and calling trimLRPatterns with with.Rindels=TRUE and an Rpattern equal to
the reverse of your Lpattern, and then reversing the results.

But the behavior you like with indels might be considered a bug, or at least something that trimLRPatterns shouldn't do, in the sense that the match you like isn't flanking, so might be artificial. Also, setting the max.mismatch values is difficult because one can't know in advance how many letters of the current pattern will extend off the edge of the subject, which depends on the match, i.e. how many inserts into the pattern will be required. Worse, "the" match-with-indels is not at all unique. In my mind, it also isn't clear that Herve's 'best local match' (shortest one with the minimal edit distance) is what trimLRPatterns wants. It might want the longest match with the minimal edit distance, the longest match with maximal allowed edit distance, etc.

I have a version that demands that matches-with-indels start at the beginning of the subject, for Lpattern, and end at the end of the subject, for Rpattern. [The ending part would also not be implemented yet, if I needed Proffset, but I do it by the reversing trick.] I trim by the length of the current pattern rather than the length of the match. This is a one-size-fits-all solution, since the match could be shorter or longer. Maybe I should trim by the length of the pattern plus the number of allowed edits. The max.mismatch limits are
less circular, so easier to set.

I was about to submit a patch, but now it's open to debate.

On Jan 25, 2010, at 9:54 PM, Marcus Davy wrote:

Hi,
I have some 454 data which is expected to contain a primer at the beginning.

I noticed that using trimLRPatterns to trim an Lpattern at the 5' start of a sequence with 1 mismatch does not allow matches to the subject sequence up to the number of mismatches directly to the right of the sequence start (and
also to the left of an Rpattern at the 3' end).

So if my example primer Lpattern is 23 bases in length, trimLRPatterns with 1 mismatch will match all sequences at positions 0 to 22 and 1 to 23, but not 2 to 24 whereas if you use vmatchPattern, it will find all three of
these positions.

My question is should an option/arg in trimLRPatterns be made available that allows matches to the right of the Lpattern up to the number of mismatches,
and to the left of the Rpattern up to the number of mismatches?

I think the arg e.g. 'with.Lindels=TRUE' (which appears to have not been
enabled yet in Biostrings_2.14.8) may partially resolve this if the
insertion is somewhere at the beginning of a subject sequence (within the number of mismatches allowed) which can then be matched by the Lpattern.


mismatches <- 1
pattern    <- "AAGCAGTGGTATCAACGCAGAGT"
w          <- width(pattern)
mm         <- rep(mismatches, w)
base       <- "G"
n          <- 20

subjectList <- list(
+ "Primer0+poly(T)" = paste(substring(pattern, 2,w),
polyn(base, n), sep=""),
+ "Primer1+poly(T)" = paste(pattern, polyn (base, n),
sep=""),
+ "Primer2+poly(T)" = paste("A", substring (pattern,1,w),
polyn(base,n), sep="")
+                     )

subjectSet <- DNAStringSet(unlist(subjectList))

print(subjectSet)
  A DNAStringSet instance of length 3
    width seq                                               names

[1] 42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer0 +poly(T) [2] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer1 +poly(T) [3] 44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer2 +poly(T)
cat("Primer:  ", pattern, "\n")
Primer:  AAGCAGTGGTATCAACGCAGAGT

## trimLRPatterns -LPattern
LRcoords <- trimLRPatterns(Lpattern = pattern, subject = subjectSet,
 max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
## Fails to match subjectSet[3:4] - trimming with mismatches is to the
left of Lpattern (and to the right of Rpattern)
DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
  A DNAStringSet instance of length 3
    width seq                                               names

[1] 20 GGGGGGGGGGGGGGGGGGGG Primer0 +poly(T) [2] 20 GGGGGGGGGGGGGGGGGGGG Primer1 +poly(T) [3] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer2 +poly(T)

## vmatchPattern
tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
matchIndex <- which(as.logical(countIndex(tmp)))
## only one match per sequence so indices preserved with unlist
VMcoords <- unlist(tmp)
## Matches to subjectSet[3]
DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
width(subjectSet)[matchIndex])
  A DNAStringSet instance of length 3
    width seq                                               names

[1] 20 GGGGGGGGGGGGGGGGGGGG Primer0 +poly(T) [2] 20 GGGGGGGGGGGGGGGGGGGG Primer1 +poly(T) [3] 20 GGGGGGGGGGGGGGGGGGGG Primer2 +poly(T)

cheers,


 Marcus


sessionInfo()
R version 2.10.0 (2009-10-26)
powerpc-apple-darwin8.11.1

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
[4] ShortRead_1.4.0     lattice_0.17-26     BSgenome_1.14.2
[7] Biostrings_2.14.8   IRanges_1.4.9

loaded via a namespace (and not attached):
[1] DBI_0.2-4 RSQLite_0.7-3 grid_2.10.0 hwriter_1.1 tools_2.10.0

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to