Hi everybody,
The function matchPWM() that is invoked like this
matchPWM(pwm, subject, min.score="80%", ...)
provides a very fast way of identifying the positions in 'subject' that
match a Position Weight Matrix with a minimum score of 80 percent (relative
to the maximum score).
The result looks like this
matchPWM(pwmMa,Mmusculus[['chr19']],min.score="90%")
Views on a 61342430-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNN...TATCTCCGAGATGCACCCGCCTGCTTGC
views:
start end width
[1] 3499390 3499407 18 [TTTTGGTCATGGTGCTTC]
[2] 3847991 3848008 18 [TTGAATTCCTGATCCTTC]
[3] 4433615 4433632 18 [TTTGATTCCTGATTCTTC]
[4] 5200625 5200642 18 [TCTTAGTCATGGTGCTTT]
[5] 6475713 6475730 18 [TCTTGTTTCTGCTGCTTC]
[6] 8899734 8899751 18 [TTTCATTCCTGTTTCTTT]
[7] 9435046 9435063 18 [TTGTGTTCCTGCTGCTTC]
[8] 9490297 9490314 18 [TTTTATTCATGCTGTTTC]
[9] 10148754 10148771 18 [TCATGGTCATGGTGCTTC]
... ... ... ... ...
[103] 53870063 53870080 18 [TCAAAGTCATGCTCCTTC]
[104] 55886021 55886038 18 [TCTTTTTCCTGGTACTTC]
[105] 56158096 56158113 18 [GCTGATTCTTGCTGCTTC]
[106] 56585172 56585189 18 [TTTCATTCCAGTTGCTTA]
[107] 56905278 56905295 18 [TTCTGTTCCTGGTACTTC]
[108] 57831404 57831421 18 [TTTTATTTATGTTACTTC]
[109] 58794481 58794498 18 [TCGAATTCCTGCTGCTTT]
[110] 59452136 59452153 18 [TTGCATTTGTGTTACTTC]
[111] 59597303 59597320 18 [TTTTAGTCATGGTGTTTC]
My question:
How do you get matchPWM to report the score?
Why that is important:
The function PWMscoreStartingAt() was created to answer that question. Now,
for some PWMs, running matchPWM with a permissive min.score like "80%"
results in half a million matches. The problem is that PWMscoreStartingAt()
can compute the scores one match at a time. In other words, it takes a
DNAString but not a DNAStringSet.
Rather than hoping for a vercotised version of PWMscoreStartingAt(), I guess
that it is more practical to try to retrieve the score computed by
matchPWM().
Any suggestions?
Thank you,
Ivan
> sessionInfo()
R version 2.13.0 Under development (unstable) (2010-12-02 r53747)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C LC_TIME=en_US.utf8
[4] LC_COLLATE=en_US.utf8 LC_MONETARY=C
LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
base
other attached packages:
[1] multicore_0.1-4 lattice_0.19-17
[3] BSgenome.Mmusculus.UCSC.mm9_1.3.17 BSgenome_1.19.3
[5] cosmo_1.17.0 seqLogo_1.17.0
[7] GenomicRanges_1.3.18 Biostrings_2.19.9
[9] IRanges_1.9.22
loaded via a namespace (and not attached):
[1] Biobase_2.11.8 tools_2.13.0
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing