Thanks for your comments,
just a note for others the arg 'with.Rindels' functionality in
trimLRPatterns is not in the current stable release of Bioconductor 2.5, you
need to install the development version
>
trimLRPatterns(Rpattern=reverse(pattern),subject=reverse(subjectSet),max.Rmismatch=rep(1,nchar(pattern)),
with.Rindels=TRUE)
Error in .matchPatternAt(pattern, subject, ending.at, 1L, max.mismatch, :
_nedit_for_Proffset() is not ready yet, sorry!
>
The primer Lpatterns shifted slightly to the right of some sequences appear
to be real in the 454 data I have.
I have 832 matches at start position 2;
> startPos
0 1 2 ...
7733 99790 832 ...
Pattern: AAGCAGTGGTATCAACGCAGAGT
Subject:AAGGCAGTGGTATCAACGCAGAGTACATGGG...
Pattern: AAGCAGTGGTATCAACGCAGAGT
Subject:AAGGCAGTGGTATCAACGCAGAGTACTTTN...
Pattern: AAGCAGTGGTATCAACGCAGAGT
Subject:AAGGCAGTGGTATCAACGCAGAGTACATGG...
So trimLRPatterns (without indels) in a simple Lpattern example with a
Lpattern of length 3 and 1 mismatch works like this;
Pos 01234
Pattern:XXX
Pattern: XXX
Subject: ---
Whereas in a trimmed start sequence of length 4, vmatchPattern would match
this;
Pos 01234
Pattern:XXX
Pattern: XXX
Pattern: XXX
Subject: ---
where any match can be a perfect match to all three bases in the subject, or
2 of the three in the subject
Marcus
On Wed, Jan 27, 2010 at 3:55 AM, Harris A. Jaffee <[email protected]> wrote:
> You can get the with.Lindels=TRUE behavior you want by reversing your data
> and calling trimLRPatterns with with.Rindels=TRUE and an Rpattern equal to
> the reverse of your Lpattern, and then reversing the results.
>
> But the behavior you like with indels might be considered a bug, or at
> least
> something that trimLRPatterns shouldn't do, in the sense that the match you
> like isn't flanking, so might be artificial. Also, setting the
> max.mismatch
> values is difficult because one can't know in advance how many letters of
> the
> current pattern will extend off the edge of the subject, which depends on
> the
> match, i.e. how many inserts into the pattern will be required. Worse,
> "the"
> match-with-indels is not at all unique. In my mind, it also isn't clear
> that
> Herve's 'best local match' (shortest one with the minimal edit distance) is
> what trimLRPatterns wants. It might want the longest match with the
> minimal
> edit distance, the longest match with maximal allowed edit distance, etc.
>
> I have a version that demands that matches-with-indels start at the
> beginning
> of the subject, for Lpattern, and end at the end of the subject, for
> Rpattern.
> [The ending part would also not be implemented yet, if I needed Proffset,
> but
> I do it by the reversing trick.] I trim by the length of the current
> pattern
> rather than the length of the match. This is a one-size-fits-all solution,
> since the match could be shorter or longer. Maybe I should trim by the
> length
> of the pattern plus the number of allowed edits. The max.mismatch limits
> are
> less circular, so easier to set.
>
> I was about to submit a patch, but now it's open to debate.
>
>
> On Jan 25, 2010, at 9:54 PM, Marcus Davy wrote:
>
> Hi,
>> I have some 454 data which is expected to contain a primer at the
>> beginning.
>>
>> I noticed that using trimLRPatterns to trim an Lpattern at the 5' start of
>> a
>> sequence with 1 mismatch does not allow matches to the subject sequence up
>> to the number of mismatches directly to the right of the sequence start
>> (and
>> also to the left of an Rpattern at the 3' end).
>>
>> So if my example primer Lpattern is 23 bases in length, trimLRPatterns
>> with
>> 1 mismatch will match all sequences at positions 0 to 22 and 1 to 23, but
>> not 2 to 24 whereas if you use vmatchPattern, it will find all three of
>> these positions.
>>
>> My question is should an option/arg in trimLRPatterns be made available
>> that
>> allows matches to the right of the Lpattern up to the number of
>> mismatches,
>> and to the left of the Rpattern up to the number of mismatches?
>>
>> I think the arg e.g. 'with.Lindels=TRUE' (which appears to have not been
>> enabled yet in Biostrings_2.14.8) may partially resolve this if the
>> insertion is somewhere at the beginning of a subject sequence (within the
>> number of mismatches allowed) which can then be matched by the Lpattern.
>>
>>
>> mismatches <- 1
>>> pattern <- "AAGCAGTGGTATCAACGCAGAGT"
>>> w <- width(pattern)
>>> mm <- rep(mismatches, w)
>>> base <- "G"
>>> n <- 20
>>>
>>> subjectList <- list(
>>>
>> + "Primer0+poly(T)" = paste(substring(pattern,2,w),
>> polyn(base, n), sep=""),
>> + "Primer1+poly(T)" = paste(pattern, polyn(base, n),
>> sep=""),
>> + "Primer2+poly(T)" = paste("A",
>> substring(pattern,1,w),
>> polyn(base,n), sep="")
>> + )
>>
>>>
>>> subjectSet <- DNAStringSet(unlist(subjectList))
>>>
>>> print(subjectSet)
>>>
>> A DNAStringSet instance of length 3
>> width seq names
>>
>> [1] 42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>> Primer0+poly(T)
>> [2] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>> Primer1+poly(T)
>> [3] 44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>> Primer2+poly(T)
>>
>>> cat("Primer: ", pattern, "\n")
>>>
>> Primer: AAGCAGTGGTATCAACGCAGAGT
>>
>>>
>>> ## trimLRPatterns -LPattern
>>> LRcoords <- trimLRPatterns(Lpattern = pattern, subject = subjectSet,
>>>
>> max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
>>
>>> ## Fails to match subjectSet[3:4] - trimming with mismatches is to the
>>>
>> left of Lpattern (and to the right of Rpattern)
>>
>>> DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
>>>
>> A DNAStringSet instance of length 3
>> width seq names
>>
>> [1] 20 GGGGGGGGGGGGGGGGGGGG
>> Primer0+poly(T)
>> [2] 20 GGGGGGGGGGGGGGGGGGGG
>> Primer1+poly(T)
>> [3] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>> Primer2+poly(T)
>>
>>>
>>> ## vmatchPattern
>>> tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
>>> matchIndex <- which(as.logical(countIndex(tmp)))
>>> ## only one match per sequence so indices preserved with unlist
>>> VMcoords <- unlist(tmp)
>>> ## Matches to subjectSet[3]
>>> DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
>>>
>> width(subjectSet)[matchIndex])
>> A DNAStringSet instance of length 3
>> width seq names
>>
>> [1] 20 GGGGGGGGGGGGGGGGGGGG
>> Primer0+poly(T)
>> [2] 20 GGGGGGGGGGGGGGGGGGGG
>> Primer1+poly(T)
>> [3] 20 GGGGGGGGGGGGGGGGGGGG
>> Primer2+poly(T)
>>
>> cheers,
>>
>>
>> Marcus
>>
>>
>> sessionInfo()
>>>
>> R version 2.10.0 (2009-10-26)
>> powerpc-apple-darwin8.11.1
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
>> [4] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.2
>> [7] Biostrings_2.14.8 IRanges_1.4.9
>>
>> loaded via a namespace (and not attached):
>> [1] DBI_0.2-4 RSQLite_0.7-3 grid_2.10.0 hwriter_1.1 tools_2.10.0
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing