You hint that your trimLen is > 50, yet when you show your
max.Rmismatch, it's clearly of length 47, composed
of rep(1,12) and rep(2,N) with N=trimLen-15. So N=35 and
trimLen=50. A max.Rmismatch of length 47 (= 50-3)
will be augmented by rep(-1,3) at the bottom, preventing any matches
of an adapter prefix of length 3 or less
with a read suffix. This is an effect that you want, but to allow
more mismatch tolerance for longer adapter
prefixes, against read suffixes, your max.Rmismatch vector must be
monotone increasing, even if not strictly,
rather than decreasing. So your max.Rmismatch is at least reversed.
Its max (last value) might also be > 2.
Perhaps like this:
> ceiling(1:47 * 1/12)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
3 3 3 3 4 4
[39] 4 4 4 4 4 4 4 4 4
This will be better than any max.Rmismatch=e, with e in (0,1).
-Harris
On Feb 22, 2011, at 11:10 PM, Kunbin Qu wrote:
Hi,
I have a 50 cycle RNA-seq run, with variable length of adapters in
some reads. The length of the adapter is longer than 50 bp, and the
portion in each read could be as high as the full length of 50
cycles, as well as zero bp. When I trim the adapter, I do not want
over trim, so the max.Rmismatch parameter I used was shorter than
(eg, 3 bp shorter) than the full length, also with more mismatch
tolerance for longer portion, less mismatch for short segment. So I
wrote like the following:
trimLen<-length(adapter)
trimCoords<-trimLRPatterns(Rpattern=adapter, subject=seqs,
max.Rmismatch = c(rep(2,trimLen-15), rep(1, 12)), ranges=T,
with.Rindels=TRUE)
the value for the max.Rmismatch is basically:
c(rep(2,trimLen-15), rep(1, 12))
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 1 1 1
[39] 1 1 1 1 1 1 1 1 1
So I want to leave the last 3 bp alone (to avoid over-trimming by
chance), then followed by 12 bp with 1 mismatch, then 35 bp with 2
mismatch. Somehow when I tried this to the read, it does not seem
work, the trimmed reads really do not seem what they should be.
Anyone can help me on this? Thanks
-Kunbin
sessionInfo()
R version 2.11.0 (2010-04-22)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.19-11
[4] Biostrings_2.16.7 GenomicRanges_1.0.1 IRanges_1.6.8
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 grid_2.11.0 hwriter_1.3 tools_2.11.0
______________________________________________________________________
The contents of this electronic message, including any attachments,
are intended only for the use of the individual or entity to which
they are addressed and may contain confidential information. If you
are not the intended recipient, you are hereby notified that any
use, dissemination, distribution, or copying of this message or any
attachment is strictly prohibited. If you have received this
transmission in error, please send an e-mail to
[email protected] and delete this message, along with
any attachments, from your computer.
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing