Re: [Bioc-sig-seq] matchPDict, fixed=FALSE; "walk_tb_nonfixed_subject(): implement me"

Hervé Pagès Mon, 19 Jul 2010 12:15:32 -0700

On 07/19/2010 12:10 PM, Hervé Pagès wrote:

Hi Ludo,


matchPDict() and family now support fixed=FALSE by default (i.e.
when the PDict object was used with algo="ACtree2"):

                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        was preprocessed with algo="ACtree2", which
is the default

Cheers,
H.

 > library(Biostrings)
 > dict <- DNAStringSet(c("TTC", "CTT"))
 > pdict <- PDict(dict)
 > subject <- DNAString("CYTCACTTC")
 > as.list(matchPDict(pdict, subject, fixed="pattern"))
[[1]]
IRanges of length 2
start end width
[1] 2 4 3
[2] 7 9 3

[[2]]
IRanges of length 2
start end width
[1] 1 3 3
[2] 6 8 3

Matching the probes of an Affy chip to the Human chr1 sequence
previously altered by injecting all known SNPs from dbSNP:

library(hgu95av2probe)
probes <- DNAStringSet(hgu95av2probe)
pdict <- PDict(probes)

library(BSgenome.Hsapiens.UCSC.hg19)
library(SNPlocs.Hsapiens.dbSNP.20100427)
SNP_Hsapiens <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20100427")
SNP_subject <- SNP_Hsapiens$chr1

This takes about 2m40s on my laptop (and uses about 1.5GB of memory):

mi1 <- matchPDict(pdict, SNP_subject, fixed="pattern")

Note that the fixed="pattern" feature only works if the density of
ambiguity letters in the subject is not too high. For your
particular situation where it contains a stretch of 20 N's it might
not work (hard to say because the limit also depends on the number
of nodes in the Aho-Corasick tree, which itself depends on the number
and lengths of your reads).

Create a 550bp subject with a 20N stretch in the middle:
subject <- DNAString(paste(c(sample(DNA_BASES, 265, replace=TRUE),
rep.int("N", 20),
sample(DNA_BASES, 265, replace=TRUE)),
collapse=""))

Create 201800 80bp reads:
reads <- narrow(start=11,
xscat(probes, reverseComplement(probes),
rev(probes), rev(reverseComplement(probes))),
width=80)

Try to align all the reads at once:

 > pdict <- PDict(reads)
 > mi2 <- matchPDict(pdict, subject, fixed="pattern")
Error in .match.PDict3Parts.XString(pd...@threeparts, subject,
max.mismatch, :
too many IUPAC ambiguity letters in 'subject'

But with a smaller dictionary:

 > pdict <- PDict(reads[1:100000])
 > mi3 <- matchPDict(pdict, subject, fixed="pattern")

...it works (and takes about 7s on my laptop).

Unfortunately, while working on this, I discovered that the
'fixed=FALSE' feature for matchPDict() and family was broken so
far (i.e. for Biostrings < 2.16.8): it would miss some matches
under certain circumstances.
I spent some good amount of time testing the new implementation
(my test is still running, should take about 5 days to complete)
by comparing the matches obtained with matchPDict() with those
obtained with calling matchPattern() within a loop (for the
latter the shift-or algo is used which is a completely different
approach/implementation, it's also much slower). The dict used
for this test was hgu95av2probe and the subject was Human chr1
(hg19), like in the 1st example above. The 2 methods are producing
exactly the same set of matches so far (after checking the first
126000 probes in hgu95av2probe).

This is available in the latest release version of Biostrings
(2.16.8). Please update your packages as explained here:

http://bioconductor.org/docs/install/

Let me know if you have any questions.

Cheers,
H.

On 06/25/2010 11:03 AM, Hervé Pagès wrote:

Hi Ludo,

Yes matchPDict() used to support fixed=FALSE. It still does, but only
when the PDict object is made using the old implementation of the
Aho-Corasick algo ('algo="ACtree"'):

> pdict <- PDict(c("ACCT", "GACC", "CCCT", "CCCA"), algo="ACtree")
> matchPDict(pdict, DNAString("GNCCT"), fixed="pattern")[[3]]
IRanges of length 1
start end width
[1] 2 5 4

The "ACtree" algo has been superseded by the "ACtree2" algo, a faster
and more memory efficient implementation of the same algo that uses a
different internal representation than "ACtree" for the Aho-Corasick
tree.

The 'fixed=TRUE' (or 'fixed="pattern"') option is not yet supported
for PDict objects built with the new algo. I'll add this ASAP. Thanks
for the reminder!

Cheers,
H.

On 06/25/2010 03:46 AM, Ludo Pagie wrote:


hi all,

I'm trying to match 80bp reads to a construct, a sequence of +/-
550bp. The construct contains a strecth of N's, representing a
stretch of 20 random nucleotides.

I constructed a pdict from the reads, and a DNAString from the
construct. When I run matchPDict with fixed=TRUE, all goes fine
and I get 1.2M matches.

construct_mindex<- matchPDict(pdict, DNAString(construct),
max.mismatch=3)
sum(countIndex(construct_mindex))

[1] 1280283


With fixed=FALSE I get the following error:

construct_mindex<- matchPDict(pdict, DNAString(construct),
max.mismatch=3, fixed=FALSE)

Error in .match.PDict3Parts.XString(pd...@threeparts, subject,
max.mismatch, :
walk_tb_nonfixed_subject(): implement me

Is there a way around this non-implemented function? Or any
chance it will be implemented soon? Or am I missing something.

If you need more background let me know.

Ludo

sessionInfo()

R version 2.12.0 Under development (unstable) (2010-06-17
r52313)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets
methods base

other attached packages:
[1] ShortRead_1.7.7 Rsamtools_1.1.7
lattice_0.18-8
[4] GenomicRanges_1.1.12 Biostrings_2.17.7 IRanges_1.7.7

loaded via a namespace (and not attached):
[1] Biobase_2.9.0 grid_2.12.0 hwriter_1.2 tools_2.12.0

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] matchPDict, fixed=FALSE; "walk_tb_nonfixed_subject(): implement me"

Reply via email to