Re: [Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Ivan Gregoretti Fri, 23 Jul 2010 08:29:24 -0700

It works. It produces and answer in under 2 minutes. I will flesh it
out a bit for posterity.


Some slight modifications must be applied. First, mySeq cannot be of
class character. So, it works if I do

countPDict(PDict(sread(A)), as(mySeq, "DNAString"))

Now, that function outputs how many times a tag is found in mySeq. To
compute how many tags match mySeq once or more, I have to do

sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0)


By the way, this could have been done with perl or python or any other
tools. However, it helps to learn how to do it efficiently from within
the Bioconductor.

Thank you.

Ivan



On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun <[email protected]> wrote:
> Ivan,
> How about
>
> countPDict(PDict(sread(A)), mySeq)
>
>
> Patrick
>
>
> On 7/23/10 7:45 AM, Ivan Gregoretti wrote:
>>
>> Hello Patrick,
>>
>> The idea of vcountPattern is good but it does not quite work for two
>> reasons
>>
>> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error
>>
>>
>>>
>>> vcountPattern(mySeq, sread(A))
>>>
>>
>> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels,  :
>>   patterns with more than 20000 letters are not supported
>>
>> 2) vcountPattern is designed to find a motif (small) contained in a
>> genome (large), like this
>> vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
>>
>> In my case, I have millions of motifs (the 36 bp tags) that I have to
>> find if they are contained in my single ~40kb. Its like a reverse
>> scenario. So, if I try reversing the arguments, I also get an error:
>>
>>
>>>
>>> vcountPattern(sread(A), mySeq)
>>>
>>
>> Error in normargPattern(pattern, subject) :
>>   'pattern' must be a single string or an XString object
>>
>> Any more suggestions?
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>>
>>> sessionInfo()
>>>
>>
>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8
>>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>> LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] annotate_1.27.1      AnnotationDbi_1.11.4 Biobase_2.9.0
>> ShortRead_1.7.9
>> [5] Rsamtools_1.1.8      lattice_0.18-8       Biostrings_2.17.24
>> GenomicRanges_1.1.17
>> [9] IRanges_1.7.12
>>
>> loaded via a namespace (and not attached):
>> [1] DBI_0.2-5     grid_2.12.0   hwriter_1.2   RSQLite_0.9-1 xtable_1.5-6
>> an
>>
>
>

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Reply via email to