Hi,
I have some 454 data which is expected to contain a primer at the beginning.

I noticed that using trimLRPatterns to trim an Lpattern at the 5' start of a
sequence with 1 mismatch does not allow matches to the subject sequence up
to the number of mismatches directly to the right of the sequence start (and
also to the left of an Rpattern at the 3' end).

So if my example primer Lpattern is 23 bases in length, trimLRPatterns with
1 mismatch will match all sequences at positions 0 to 22 and 1 to 23, but
not 2 to 24 whereas if you use vmatchPattern, it will find all three of
these positions.

My question is should an option/arg in trimLRPatterns be made available that
allows matches to the right of the Lpattern up to the number of mismatches,
and to the left of the Rpattern up to the number of mismatches?

I think the arg e.g. 'with.Lindels=TRUE' (which appears to have not been
enabled yet in Biostrings_2.14.8) may partially resolve this if the
insertion is somewhere at the beginning of a subject sequence  (within the
number of mismatches allowed) which can then be matched by the Lpattern.


> mismatches <- 1
> pattern    <- "AAGCAGTGGTATCAACGCAGAGT"
> w          <- width(pattern)
> mm         <- rep(mismatches, w)
> base       <- "G"
> n          <- 20
>
> subjectList <- list(
+                     "Primer0+poly(T)" = paste(substring(pattern,2,w),
polyn(base, n), sep=""),
+                     "Primer1+poly(T)"  = paste(pattern, polyn(base, n),
sep=""),
+                     "Primer2+poly(T)" = paste("A", substring(pattern,1,w),
polyn(base,n), sep="")
+                     )
>
> subjectSet <- DNAStringSet(unlist(subjectList))
>
> print(subjectSet)
  A DNAStringSet instance of length 3
    width seq                                               names

[1]    42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG        Primer0+poly(T)
[2]    43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG       Primer1+poly(T)
[3]    44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG      Primer2+poly(T)
> cat("Primer:  ", pattern, "\n")
Primer:  AAGCAGTGGTATCAACGCAGAGT
>
> ## trimLRPatterns -LPattern
> LRcoords     <- trimLRPatterns(Lpattern = pattern, subject = subjectSet,
 max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
> ## Fails to match subjectSet[3:4] - trimming with mismatches is to the
left of Lpattern (and to the right of Rpattern)
> DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
  A DNAStringSet instance of length 3
    width seq                                               names

[1]    20 GGGGGGGGGGGGGGGGGGGG                              Primer0+poly(T)
[2]    20 GGGGGGGGGGGGGGGGGGGG                              Primer1+poly(T)
[3]    43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG       Primer2+poly(T)
>
> ## vmatchPattern
> tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
> matchIndex <- which(as.logical(countIndex(tmp)))
> ## only one match per sequence so indices preserved with unlist
> VMcoords <- unlist(tmp)
> ## Matches to subjectSet[3]
> DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
width(subjectSet)[matchIndex])
  A DNAStringSet instance of length 3
    width seq                                               names

[1]    20 GGGGGGGGGGGGGGGGGGGG                              Primer0+poly(T)
[2]    20 GGGGGGGGGGGGGGGGGGGG                              Primer1+poly(T)
[3]    20 GGGGGGGGGGGGGGGGGGGG                              Primer2+poly(T)

cheers,


 Marcus


> sessionInfo()
R version 2.10.0 (2009-10-26)
powerpc-apple-darwin8.11.1

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
[4] ShortRead_1.4.0     lattice_0.17-26     BSgenome_1.14.2
[7] Biostrings_2.14.8   IRanges_1.4.9

loaded via a namespace (and not attached):
[1] DBI_0.2-4     RSQLite_0.7-3 grid_2.10.0   hwriter_1.1   tools_2.10.0

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to