Lana,
Given that the primer is smaller than the read and that the primer, if
present, is expected to reside on the 5' side of the read, I recommend
you use the pairwiseAlignment function directly to generate overlap
alignment scores, rather than use the global pairwise alignment scores
generated by the srdistance function:
cleanedReads <- clean(fq4)
primer <- DNAString("TCGTATGCCGTCTTCTGCTTG")
submat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1,
baseOnly = TRUE)
dist1 <- - pairwiseAlignment(cleanedReads, primer, type = "overlap",
substitutionMatrix = submat,
gapOpening = 0, gapExtension = -1,
scoreOnly = TRUE)
f <- cleanedReads[dist1 < 5]
Patrick
Martin Morgan wrote:
"Lana Schaffer" <[email protected]> writes:
Hi,
I have read Feb 2009 archives and have been trying to
filter alot of primer reads to see what I short reads
remaining.
The small RNA primer (TCGTATGCCGTCTTCTGCTTG) attached to
a series of A's is most contamination of the reads that
I would like to filter.
-------------------------------------------------------
dist1 <- srdistance(clean(fq4), "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAA")
table(dist1[[1]])
4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19
9338 789 406 121 2094 240 184 55 332 78 90 25 68 16
62 31
20 21 22 23 24 25 26 28 29
166 550 623 640 318 65 6 1 4
f <- fq4[dist1[[1]] <5]
clean(fq4) != fq4, so if this is your code you're subsetting the wrong
object.
Martin
[1] 35 NTAGTACTCTGCGTTGTGGCCGCAGCCACCTCGGT
[2] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[3] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[4] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[5] 35 NCTGGACTTGGAGTCAGAAGATCTCGTATGCCGTC
[6] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[7] 35 GGTATGATTCTCGCATCTCGTATGCCGTCTTCTGC
[8] 35 GGTATGATTCTCGCATCTCGTATGCCGTCTCCTGC
[9] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
... ... ...
[9363] 35 TCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA
[9364] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[9365] 35 TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAA
[9366] 35 ATATAATACAACCTGCTAAGTGATCTCGTATGCCG
[9367] 35 ATCTCGTATGCCGTCTTCTGCTTGACAAAAAAAAA
[9368] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAACAACAA
[9369] 35 ATCTCGTATGCCGTCTTCTGCTTGAACCACACAAA
[9370] 35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAAAACCA
[9371] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
f <- fq4[dist1[[1]] >28]
[1] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[2] 35 CGATCATCTCGTATGCCGTCTTCTGCTTGAAAAAA
[3] 35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAACAACC
[4] 35 CAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAA
---------------------------------------------------------
You can see that I am not doing a good filtering job.
d<5 is showing some sequences free of primer that I would
want to save.
I have tried the polyn function, but that does not work for me
when I use a series of 10-20 A's (<35).
Would someone be able to give me some suggestions?
sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-12 r47905)
i386-pc-mingw32
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.1.50 lattice_0.17-20 BSgenome_1.11.9
Biostrings_2.11.42
[5] IRanges_1.1.54
loaded via a namespace (and not attached):
[1] Biobase_2.3.11 grid_2.9.0 hwriter_1.1
Matrix_0.999375-20
Lana Schaffer
Biostatistics/Informatics
The Scripps Research Institute
DNA Array Core Facility
La Jolla, CA 92037
(858) 784-2263
(858) 784-2994
[email protected]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing