On Oct 30, 2010, at 12:46 AM, Kunbin Qu wrote:
I tried that, it did not work (and I do not quite understand why
you suggest Rpattern="TGGT" or "TGAC"):
t1<-DNAString("AAAATGGTAC")
t2<-DNAString("TGGT")
t3<-trimLRPatterns(Rpattern=t2,subject=t1)
t3
10-letter "DNAString" instance
seq: AAAATGGTAC
t3<-trimLRPatterns(Rpattern=t2,subject=t1, max.Rmismatch=rep(0,4))
t3
10-letter "DNAString" instance
seq: AAAATGGTAC
t2<-DNAString("TGAC")
trimLRPatterns(Rpattern=t2,subject=t1, max.Rmismatch=rep(0,4))
10-letter "DNAString" instance
seq: AAAATGGTAC
I got the impression that you wanted your pattern to start with TG
but wanted any
trimming confined to letters 7-10 of your t1 (especially leaving the
TG at 5-6).
Either of "TGGT" or "TGAC" might suffice depending on the mismatch
limits.
For example, TGAC becomes TG|GT|AC by inserting GT, and it also
becomes AC after
deleting TG, so the following call determines that there is a
permissible fuzzy
match for TGAC, ending at the end of the subject, and it trims 4
letters:
> trimLRPatterns(Rpattern="TGAC", subject="AAAATGGTAC",
with.Rindels=T, max.Rmismatch=2)
[1] "AAAATG"
The supporting C code finds the AC match, but doesn't return that
information. We
currently don't use the heavier pairwiseAlignment machinery, which
could return all
possible lengths of matches, especially the maximum (6: TGGTAC). So
trimLRPatterns
trims 4 letters as an approximation for the range [2, 6] of possible
match lengths.
It would seem contrary to your wishes, but I have lobbied for using
the maximum of
that range, so TGGTAC would get trimmed, or maybe as a compromise,
half-way toward
that maximum, so trim GGTAC. I have experimented with a new option
of this sort:
0 <= "R.extra.trim" <= max.Rmismatch
which can be implemented by post-processing (after using ranges=TRUE
internally).
We should talk privately if I still don't understand your goal.
If you mean to allow trimming a pattern like "TG.." in letters 7-10
or less of your subject, even
if TG occurs in letters 5-6, as in your t1, then you might want
Rpattern="TGGT", and max.Rmismatch
should have length 4 (or less, if you want to prevent trimming less
than 4 letters).
If I didn't get the point, a longer answer:
Usually, I over-trimmed when I set the mismatch limits too high
(especially with indels allowed),
probably when I used the "rate" feature for the mismatch parameter,
and that resulted in certain
mismatch elements being larger than I wanted. But you have zero
mismatches allowed, and your whole
Rpattern occurs exactly, as the final 6 letters of your subject. So
your example is a good example
of what the function does (except that you don't need to allow
indels, and you could get your t4 as
the value of trimLRPatterns by using ranges=F, the default). You can
put -1 in the mismatch vector,
for example
M = some integer
trimLRPatterns(Rpattern=t2, subject=t1, max.Rmismatch = c(rep(-1,5),
M))
This allows the 6-letter suffix of your subject to be trimmed if it
matches your Rpattern with M
or less errors, but it prevents trimming by anything shorter.
Actually, your rep(0,4) is the same
as c(rep(-1,2), rep(0,4)). That is, trimming by 6 or 5 letters is
allowed, but not by less. Thus,
trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",
max.Rmismatch=0)
[1] "AAAA"
trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",
max.Rmismatch=c(rep(-1,5), 0))
[1] "AAAA"
You can even put -1 as the 6th element of max.Rmismatch, but then you
might as well start with
Rpattern="TGGTA"; or -1 in positions 5 and 6, so, Rpattern="TGGT".
______________________________________________________________________
The contents of this electronic message, including any attachments,
are intended only for the use of the individual or entity to which
they are addressed and may contain confidential information. If you
are not the intended recipient, you are hereby notified that any
use, dissemination, distribution, or copying of this message or any
attachment is strictly prohibited. If you have received this
transmission in error, please send an e-mail to
[email protected] and delete this message, along with
any attachments, from your computer.
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing