Ivan,
Sorry for the evolving answer, but this may prove to be faster
length(whichPDict(PDict(sread(A)), as(mySeq, "DNAString")))
Patrick
On 7/23/10 8:28 AM, Ivan Gregoretti wrote:
It works. It produces and answer in under 2 minutes. I will flesh it
out a bit for posterity.
Some slight modifications must be applied. First, mySeq cannot be of
class character. So, it works if I do
countPDict(PDict(sread(A)), as(mySeq, "DNAString"))
Now, that function outputs how many times a tag is found in mySeq. To
compute how many tags match mySeq once or more, I have to do
sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0)
By the way, this could have been done with perl or python or any other
tools. However, it helps to learn how to do it efficiently from within
the Bioconductor.
Thank you.
Ivan
On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun<[email protected]> wrote:
Ivan,
How about
countPDict(PDict(sread(A)), mySeq)
Patrick
On 7/23/10 7:45 AM, Ivan Gregoretti wrote:
Hello Patrick,
The idea of vcountPattern is good but it does not quite work for two
reasons
1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error
vcountPattern(mySeq, sread(A))
Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels, :
patterns with more than 20000 letters are not supported
2) vcountPattern is designed to find a motif (small) contained in a
genome (large), like this
vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
In my case, I have millions of motifs (the 36 bp tags) that I have to
find if they are contained in my single ~40kb. Its like a reverse
scenario. So, if I try reversing the arguments, I also get an error:
vcountPattern(sread(A), mySeq)
Error in normargPattern(pattern, subject) :
'pattern' must be a single string or an XString object
Any more suggestions?
Thank you,
Ivan
sessionInfo()
R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] annotate_1.27.1 AnnotationDbi_1.11.4 Biobase_2.9.0
ShortRead_1.7.9
[5] Rsamtools_1.1.8 lattice_0.18-8 Biostrings_2.17.24
GenomicRanges_1.1.17
[9] IRanges_1.7.12
loaded via a namespace (and not attached):
[1] DBI_0.2-5 grid_2.12.0 hwriter_1.2 RSQLite_0.9-1 xtable_1.5-6
an
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing