It works. It produces and answer in under 2 minutes. I will flesh it out a bit for posterity.
Some slight modifications must be applied. First, mySeq cannot be of class character. So, it works if I do countPDict(PDict(sread(A)), as(mySeq, "DNAString")) Now, that function outputs how many times a tag is found in mySeq. To compute how many tags match mySeq once or more, I have to do sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0) By the way, this could have been done with perl or python or any other tools. However, it helps to learn how to do it efficiently from within the Bioconductor. Thank you. Ivan On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun <[email protected]> wrote: > Ivan, > How about > > countPDict(PDict(sread(A)), mySeq) > > > Patrick > > > On 7/23/10 7:45 AM, Ivan Gregoretti wrote: >> >> Hello Patrick, >> >> The idea of vcountPattern is good but it does not quite work for two >> reasons >> >> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error >> >> >>> >>> vcountPattern(mySeq, sread(A)) >>> >> >> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels, : >> patterns with more than 20000 letters are not supported >> >> 2) vcountPattern is designed to find a motif (small) contained in a >> genome (large), like this >> vcountPattern("GCCACCAGGGGGCGC", Mmusculus) >> >> In my case, I have millions of motifs (the 36 bp tags) that I have to >> find if they are contained in my single ~40kb. Its like a reverse >> scenario. So, if I try reversing the arguments, I also get an error: >> >> >>> >>> vcountPattern(sread(A), mySeq) >>> >> >> Error in normargPattern(pattern, subject) : >> 'pattern' must be a single string or an XString object >> >> Any more suggestions? >> >> Thank you, >> >> Ivan >> >> >>> >>> sessionInfo() >>> >> >> R version 2.12.0 Under development (unstable) (2010-03-25 r51410) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 >> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=C >> LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] annotate_1.27.1 AnnotationDbi_1.11.4 Biobase_2.9.0 >> ShortRead_1.7.9 >> [5] Rsamtools_1.1.8 lattice_0.18-8 Biostrings_2.17.24 >> GenomicRanges_1.1.17 >> [9] IRanges_1.7.12 >> >> loaded via a namespace (and not attached): >> [1] DBI_0.2-5 grid_2.12.0 hwriter_1.2 RSQLite_0.9-1 xtable_1.5-6 >> an >> > > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
