Hi, I have read Feb 2009 archives and have been trying to filter alot of primer reads to see what I short reads remaining. The small RNA primer (TCGTATGCCGTCTTCTGCTTG) attached to a series of A's is most contamination of the reads that I would like to filter. ------------------------------------------------------- dist1 <- srdistance(clean(fq4), "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAA") table(dist1[[1]]) 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 9338 789 406 121 2094 240 184 55 332 78 90 25 68 16 62 31 20 21 22 23 24 25 26 28 29 166 550 623 640 318 65 6 1 4
f <- fq4[dist1[[1]] <5] [1] 35 NTAGTACTCTGCGTTGTGGCCGCAGCCACCTCGGT [2] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA [3] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA [4] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA [5] 35 NCTGGACTTGGAGTCAGAAGATCTCGTATGCCGTC [6] 35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA [7] 35 GGTATGATTCTCGCATCTCGTATGCCGTCTTCTGC [8] 35 GGTATGATTCTCGCATCTCGTATGCCGTCTCCTGC [9] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA ... ... ... [9363] 35 TCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA [9364] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA [9365] 35 TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAA [9366] 35 ATATAATACAACCTGCTAAGTGATCTCGTATGCCG [9367] 35 ATCTCGTATGCCGTCTTCTGCTTGACAAAAAAAAA [9368] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAACAACAA [9369] 35 ATCTCGTATGCCGTCTTCTGCTTGAACCACACAAA [9370] 35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAAAACCA [9371] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA f <- fq4[dist1[[1]] >28] [1] 35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA [2] 35 CGATCATCTCGTATGCCGTCTTCTGCTTGAAAAAA [3] 35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAACAACC [4] 35 CAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAA --------------------------------------------------------- You can see that I am not doing a good filtering job. d<5 is showing some sequences free of primer that I would want to save. I have tried the polyn function, but that does not work for me when I use a series of 10-20 A's (<35). Would someone be able to give me some suggestions? sessionInfo() R version 2.9.0 Under development (unstable) (2009-02-12 r47905) i386-pc-mingw32 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.1.50 lattice_0.17-20 BSgenome_1.11.9 Biostrings_2.11.42 [5] IRanges_1.1.54 loaded via a namespace (and not attached): [1] Biobase_2.3.11 grid_2.9.0 hwriter_1.1 Matrix_0.999375-20 Lana Schaffer Biostatistics/Informatics The Scripps Research Institute DNA Array Core Facility La Jolla, CA 92037 (858) 784-2263 (858) 784-2994 [email protected] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
