Ivan,
Sorry for the evolving answer, but this may prove to be faster

length(whichPDict(PDict(sread(A)), as(mySeq, "DNAString")))


Patrick


On 7/23/10 8:28 AM, Ivan Gregoretti wrote:
It works. It produces and answer in under 2 minutes. I will flesh it
out a bit for posterity.

Some slight modifications must be applied. First, mySeq cannot be of
class character. So, it works if I do

countPDict(PDict(sread(A)), as(mySeq, "DNAString"))

Now, that function outputs how many times a tag is found in mySeq. To
compute how many tags match mySeq once or more, I have to do

sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0)


By the way, this could have been done with perl or python or any other
tools. However, it helps to learn how to do it efficiently from within
the Bioconductor.

Thank you.

Ivan



On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun<[email protected]>  wrote:
Ivan,
How about

countPDict(PDict(sread(A)), mySeq)


Patrick


On 7/23/10 7:45 AM, Ivan Gregoretti wrote:
Hello Patrick,

The idea of vcountPattern is good but it does not quite work for two
reasons

1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error


vcountPattern(mySeq, sread(A))

Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels,  :
   patterns with more than 20000 letters are not supported

2) vcountPattern is designed to find a motif (small) contained in a
genome (large), like this
vcountPattern("GCCACCAGGGGGCGC", Mmusculus)

In my case, I have millions of motifs (the 36 bp tags) that I have to
find if they are contained in my single ~40kb. Its like a reverse
scenario. So, if I try reversing the arguments, I also get an error:


vcountPattern(sread(A), mySeq)

Error in normargPattern(pattern, subject) :
   'pattern' must be a single string or an XString object

Any more suggestions?

Thank you,

Ivan


sessionInfo()

R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
x86_64-unknown-linux-gnu

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=en_US.UTF-8
  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] annotate_1.27.1      AnnotationDbi_1.11.4 Biobase_2.9.0
ShortRead_1.7.9
[5] Rsamtools_1.1.8      lattice_0.18-8       Biostrings_2.17.24
GenomicRanges_1.1.17
[9] IRanges_1.7.12

loaded via a namespace (and not attached):
[1] DBI_0.2-5     grid_2.12.0   hwriter_1.2   RSQLite_0.9-1 xtable_1.5-6
an




_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to