Re: [Bioc-sig-seq] trimLRPatterns vs vmatchPattern

Marcus Davy Wed, 27 Jan 2010 13:27:54 -0800

Thanks for your comments,

just a note for others the arg 'with.Rindels' functionality in
trimLRPatterns is not in the current stable release of Bioconductor 2.5, you
need to install the development version


>
trimLRPatterns(Rpattern=reverse(pattern),subject=reverse(subjectSet),max.Rmismatch=rep(1,nchar(pattern)),
with.Rindels=TRUE)
Error in .matchPatternAt(pattern, subject, ending.at, 1L, max.mismatch,  :
  _nedit_for_Proffset() is not ready yet, sorry!
>

The primer Lpatterns shifted slightly to the right of some sequences appear
to be real in the 454 data I have.
I have 832 matches at start position 2;

> startPos
    0     1     2    ...
 7733 99790   832    ...


Pattern: AAGCAGTGGTATCAACGCAGAGT
Subject:AAGGCAGTGGTATCAACGCAGAGTACATGGG...
Pattern: AAGCAGTGGTATCAACGCAGAGT
Subject:AAGGCAGTGGTATCAACGCAGAGTACTTTN...
Pattern: AAGCAGTGGTATCAACGCAGAGT
Subject:AAGGCAGTGGTATCAACGCAGAGTACATGG...


So trimLRPatterns (without indels) in a simple Lpattern example with a
Lpattern of length 3 and 1 mismatch works like this;


Pos     01234
Pattern:XXX
Pattern: XXX
Subject: ---

Whereas in a trimmed start sequence of length 4, vmatchPattern would match
this;

Pos     01234
Pattern:XXX
Pattern: XXX
Pattern:  XXX
Subject: ---

where any match can be a perfect match to all three bases in the subject, or
2 of the three in the subject


Marcus


On Wed, Jan 27, 2010 at 3:55 AM, Harris A. Jaffee <[email protected]> wrote:

> You can get the with.Lindels=TRUE behavior you want by reversing your data
> and calling trimLRPatterns with with.Rindels=TRUE and an Rpattern equal to
> the reverse of your Lpattern, and then reversing the results.
>
> But the behavior you like with indels might be considered a bug, or at
> least
> something that trimLRPatterns shouldn't do, in the sense that the match you
> like isn't flanking, so might be artificial.  Also, setting the
> max.mismatch
> values is difficult because one can't know in advance how many letters of
> the
> current pattern will extend off the edge of the subject, which depends on
> the
> match, i.e. how many inserts into the pattern will be required.  Worse,
> "the"
> match-with-indels is not at all unique.  In my mind, it also isn't clear
> that
> Herve's 'best local match' (shortest one with the minimal edit distance) is
> what trimLRPatterns wants.  It might want the longest match with the
> minimal
> edit distance, the longest match with maximal allowed edit distance, etc.
>
> I have a version that demands that matches-with-indels start at the
> beginning
> of the subject, for Lpattern, and end at the end of the subject, for
> Rpattern.
> [The ending part would also not be implemented yet, if I needed Proffset,
> but
> I do it by the reversing trick.]  I trim by the length of the current
> pattern
> rather than the length of the match.  This is a one-size-fits-all solution,
> since the match could be shorter or longer.  Maybe I should trim by the
> length
> of the pattern plus the number of allowed edits.  The max.mismatch limits
> are
> less circular, so easier to set.
>
> I was about to submit a patch, but now it's open to debate.
>
>
> On Jan 25, 2010, at 9:54 PM, Marcus Davy wrote:
>
>  Hi,
>> I have some 454 data which is expected to contain a primer at the
>> beginning.
>>
>> I noticed that using trimLRPatterns to trim an Lpattern at the 5' start of
>> a
>> sequence with 1 mismatch does not allow matches to the subject sequence up
>> to the number of mismatches directly to the right of the sequence start
>> (and
>> also to the left of an Rpattern at the 3' end).
>>
>> So if my example primer Lpattern is 23 bases in length, trimLRPatterns
>> with
>> 1 mismatch will match all sequences at positions 0 to 22 and 1 to 23, but
>> not 2 to 24 whereas if you use vmatchPattern, it will find all three of
>> these positions.
>>
>> My question is should an option/arg in trimLRPatterns be made available
>> that
>> allows matches to the right of the Lpattern up to the number of
>> mismatches,
>> and to the left of the Rpattern up to the number of mismatches?
>>
>> I think the arg e.g. 'with.Lindels=TRUE' (which appears to have not been
>> enabled yet in Biostrings_2.14.8) may partially resolve this if the
>> insertion is somewhere at the beginning of a subject sequence  (within the
>> number of mismatches allowed) which can then be matched by the Lpattern.
>>
>>
>>  mismatches <- 1
>>> pattern    <- "AAGCAGTGGTATCAACGCAGAGT"
>>> w          <- width(pattern)
>>> mm         <- rep(mismatches, w)
>>> base       <- "G"
>>> n          <- 20
>>>
>>> subjectList <- list(
>>>
>> +                     "Primer0+poly(T)" = paste(substring(pattern,2,w),
>> polyn(base, n), sep=""),
>> +                     "Primer1+poly(T)"  = paste(pattern, polyn(base, n),
>> sep=""),
>> +                     "Primer2+poly(T)" = paste("A",
>> substring(pattern,1,w),
>> polyn(base,n), sep="")
>> +                     )
>>
>>>
>>> subjectSet <- DNAStringSet(unlist(subjectList))
>>>
>>> print(subjectSet)
>>>
>>  A DNAStringSet instance of length 3
>>    width seq                                               names
>>
>> [1]    42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>>  Primer0+poly(T)
>> [2]    43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>> Primer1+poly(T)
>> [3]    44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>>  Primer2+poly(T)
>>
>>> cat("Primer:  ", pattern, "\n")
>>>
>> Primer:  AAGCAGTGGTATCAACGCAGAGT
>>
>>>
>>> ## trimLRPatterns -LPattern
>>> LRcoords     <- trimLRPatterns(Lpattern = pattern, subject = subjectSet,
>>>
>>  max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
>>
>>> ## Fails to match subjectSet[3:4] - trimming with mismatches is to the
>>>
>> left of Lpattern (and to the right of Rpattern)
>>
>>> DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
>>>
>>  A DNAStringSet instance of length 3
>>    width seq                                               names
>>
>> [1]    20 GGGGGGGGGGGGGGGGGGGG
>>  Primer0+poly(T)
>> [2]    20 GGGGGGGGGGGGGGGGGGGG
>>  Primer1+poly(T)
>> [3]    43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG
>> Primer2+poly(T)
>>
>>>
>>> ## vmatchPattern
>>> tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
>>> matchIndex <- which(as.logical(countIndex(tmp)))
>>> ## only one match per sequence so indices preserved with unlist
>>> VMcoords <- unlist(tmp)
>>> ## Matches to subjectSet[3]
>>> DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
>>>
>> width(subjectSet)[matchIndex])
>>  A DNAStringSet instance of length 3
>>    width seq                                               names
>>
>> [1]    20 GGGGGGGGGGGGGGGGGGGG
>>  Primer0+poly(T)
>> [2]    20 GGGGGGGGGGGGGGGGGGGG
>>  Primer1+poly(T)
>> [3]    20 GGGGGGGGGGGGGGGGGGGG
>>  Primer2+poly(T)
>>
>> cheers,
>>
>>
>>  Marcus
>>
>>
>>  sessionInfo()
>>>
>> R version 2.10.0 (2009-10-26)
>> powerpc-apple-darwin8.11.1
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
>> [4] ShortRead_1.4.0     lattice_0.17-26     BSgenome_1.14.2
>> [7] Biostrings_2.14.8   IRanges_1.4.9
>>
>> loaded via a namespace (and not attached):
>> [1] DBI_0.2-4     RSQLite_0.7-3 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] trimLRPatterns vs vmatchPattern

Reply via email to