Indeed, its slightly faster: 103 seconds. countPDcit() was 117 seconds. Excellent. Thank you again.
Ivan On Fri, Jul 23, 2010 at 11:52 AM, Patrick Aboyoun <[email protected]> wrote: > Ivan, > Sorry for the evolving answer, but this may prove to be faster > > length(whichPDict(PDict(sread(A)), as(mySeq, "DNAString"))) > > > Patrick > > > On 7/23/10 8:28 AM, Ivan Gregoretti wrote: >> >> It works. It produces and answer in under 2 minutes. I will flesh it >> out a bit for posterity. >> >> Some slight modifications must be applied. First, mySeq cannot be of >> class character. So, it works if I do >> >> countPDict(PDict(sread(A)), as(mySeq, "DNAString")) >> >> Now, that function outputs how many times a tag is found in mySeq. To >> compute how many tags match mySeq once or more, I have to do >> >> sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0) >> >> >> By the way, this could have been done with perl or python or any other >> tools. However, it helps to learn how to do it efficiently from within >> the Bioconductor. >> >> Thank you. >> >> Ivan >> >> >> >> On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun<[email protected]> >> wrote: >> >>> >>> Ivan, >>> How about >>> >>> countPDict(PDict(sread(A)), mySeq) >>> >>> >>> Patrick >>> >>> >>> On 7/23/10 7:45 AM, Ivan Gregoretti wrote: >>> >>>> >>>> Hello Patrick, >>>> >>>> The idea of vcountPattern is good but it does not quite work for two >>>> reasons >>>> >>>> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the >>>> error >>>> >>>> >>>> >>>>> >>>>> vcountPattern(mySeq, sread(A)) >>>>> >>>>> >>>> >>>> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels, >>>> : >>>> patterns with more than 20000 letters are not supported >>>> >>>> 2) vcountPattern is designed to find a motif (small) contained in a >>>> genome (large), like this >>>> vcountPattern("GCCACCAGGGGGCGC", Mmusculus) >>>> >>>> In my case, I have millions of motifs (the 36 bp tags) that I have to >>>> find if they are contained in my single ~40kb. Its like a reverse >>>> scenario. So, if I try reversing the arguments, I also get an error: >>>> >>>> >>>> >>>>> >>>>> vcountPattern(sread(A), mySeq) >>>>> >>>>> >>>> >>>> Error in normargPattern(pattern, subject) : >>>> 'pattern' must be a single string or an XString object >>>> >>>> Any more suggestions? >>>> >>>> Thank you, >>>> >>>> Ivan >>>> >>>> >>>> >>>>> >>>>> sessionInfo() >>>>> >>>>> >>>> >>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410) >>>> x86_64-unknown-linux-gnu >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> LC_TIME=en_US.UTF-8 >>>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=C >>>> LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> LC_ADDRESS=C >>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >>>> LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] annotate_1.27.1 AnnotationDbi_1.11.4 Biobase_2.9.0 >>>> ShortRead_1.7.9 >>>> [5] Rsamtools_1.1.8 lattice_0.18-8 Biostrings_2.17.24 >>>> GenomicRanges_1.1.17 >>>> [9] IRanges_1.7.12 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] DBI_0.2-5 grid_2.12.0 hwriter_1.2 RSQLite_0.9-1 xtable_1.5-6 >>>> an >>>> >>>> >>> >>> > > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
