Hi,
I have read Feb 2009 archives and have been trying to
filter alot of primer reads to see what I short reads
remaining.
The small RNA primer (TCGTATGCCGTCTTCTGCTTG) attached to
a series of A's is most contamination of the reads that
I would like to filter.
-------------------------------------------------------
dist1 <- srdistance(clean(fq4), "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAA")
table(dist1[[1]])
   4    5    6    7    8    9   10   11   12   13   14   15   16   17
18   19 
9338  789  406  121 2094  240  184   55  332   78   90   25   68   16
62   31 
  20   21   22   23   24   25   26   28   29 
 166  550  623  640  318   65    6    1    4 

f <- fq4[dist1[[1]] <5]
   [1]    35 NTAGTACTCTGCGTTGTGGCCGCAGCCACCTCGGT
   [2]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
   [3]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
   [4]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
   [5]    35 NCTGGACTTGGAGTCAGAAGATCTCGTATGCCGTC
   [6]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
   [7]    35 GGTATGATTCTCGCATCTCGTATGCCGTCTTCTGC
   [8]    35 GGTATGATTCTCGCATCTCGTATGCCGTCTCCTGC
   [9]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
   ...   ... ...
[9363]    35 TCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA
[9364]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[9365]    35 TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAA
[9366]    35 ATATAATACAACCTGCTAAGTGATCTCGTATGCCG
[9367]    35 ATCTCGTATGCCGTCTTCTGCTTGACAAAAAAAAA
[9368]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAACAACAA
[9369]    35 ATCTCGTATGCCGTCTTCTGCTTGAACCACACAAA
[9370]    35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAAAACCA
[9371]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA

f <- fq4[dist1[[1]] >28]
[1]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
[2]    35 CGATCATCTCGTATGCCGTCTTCTGCTTGAAAAAA
[3]    35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAACAACC
[4]    35 CAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAA
---------------------------------------------------------
You can see that I am not doing a good filtering job.
d<5 is showing some sequences free of primer that I would
want to save. 
I have tried the polyn function, but that does not work for me
when I use a series of 10-20 A's (<35).  

Would someone be able to give me some suggestions?


sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-12 r47905) 
i386-pc-mingw32 
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.1.50   lattice_0.17-20    BSgenome_1.11.9
Biostrings_2.11.42
[5] IRanges_1.1.54    
loaded via a namespace (and not attached):
[1] Biobase_2.3.11     grid_2.9.0         hwriter_1.1
Matrix_0.999375-20



Lana Schaffer
Biostatistics/Informatics
The Scripps Research Institute
DNA Array Core Facility
La Jolla, CA 92037
(858) 784-2263
(858) 784-2994
[email protected] 

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to