Hi Herve, this is great. I am going to take your code and benchmark on my system. In the meantime, I wrote a python version, which did outperform my R versions, but now we might much more competitive.
thanks again, jim On Mar 20, 2010, at 4:14 PM, Hervé Pagès wrote: > I need to correct this a little bit. Yes sample() works on DNAString > and DNAStringSet objects (and on any object with a "length" and a "[" > method). So randomReads() is better implemented like this: > > randomReads <- function(nread, readlength) > { > totalsize <- nread * readlength > bsize <- 500000L > blocks <- breakInBlocks(totalsize, bsize) > tmp <- lapply(width(blocks), > function(bs) > sample(DNAString(paste(DNA_BASES, collapse="")), > bs, replace=TRUE)) > ans_subject <- do.call(c, tmp) > successiveViews(ans_subject, rep.int(readlength, nread)) > } > > It's about 3x faster than my previous version. > > Cheers, > H. > > > Hervé Pagès wrote: >> Hi Jim, >> This is an interesting use case and it reveals some of the limitations >> of the current Biostrings infrastructure. For example some operations >> like sample() don't work natively on DNAString or DNAStringSet objects. >> Having this capability would certainly help here but there are other >> bits that would need to be added to make things even easier, faster, >> and more memory efficient than the workaround I propose below. >> Writing the code below (and get it right, I think) took me a fair >> amount of time but was a really helpful exercise because it helped >> me identify/understand the bottlenecks. It still does too many >> coercions between the XString, character, and raw representations, >> and some of them could (and should) definitely be avoided if we had >> those missing bits in Biostrings. >> So for now one option is to do the sampling in the character space >> which requires that you start converting your big DNAStringSet object >> back to a standard character vector. Then I see that you use >> strsplit( , NULL) on this character vector which generates a huge >> object (8G for your 14 millions 75-mers on a 64 bit machine, because >> each 1-letter string here will take 8 bytes in memory). >> Using a "break-process-recombine" approach i.e. breaking the big >> object into smaller pieces, processing them separately, and then >> recombining the results can help reduce the memory footprint >> dramatically without impacting too much the speed. In the end, I'm >> able to mutate your 14 million 75-mers with a substitution matrix >> that is cycle dependent, in about 3 minutes and with about 3.8Gb >> of RAM (I did this on my laptop). >> But first I'll start with an illustration of the >> "break-process-recombine" approach for the reads generation. >> library(Biostrings) >> breakInBlocks <- function(totalsize, bsize) >> { >> quot <- totalsize %/% bsize >> ans_width <- rep.int(bsize, quot) >> rem <- totalsize %% bsize >> if (rem > 0L) >> ans_width <- c(ans_width, rem) >> IRanges(end=cumsum(ans_width), width=ans_width) >> } >> randomReads <- function(nread, readlength) >> { >> totalsize <- nread * readlength >> bsize <- 500000L >> blocks <- breakInBlocks(totalsize, bsize) >> tmp <- lapply(width(blocks), >> function(bs) >> DNAString(paste(sample(DNA_BASES, replace=TRUE, size=bs), >> collapse=""))) >> ans_subject <- do.call(c, tmp) >> successiveViews(ans_subject, rep.int(readlength, nread)) >> } >> Then generating 2 million 75-mers with 'randomReads(2000000L, 75L)' >> takes 37 sec. and requires 343Mb of memory on my system whereas >> without the "break-process-recombine" approach it requires 10x >> more memory! >> Generate the 14 million 75-mers: >> > system.time(reads <- randomReads(14000000L, 75L)) >> user system elapsed >> 208.817 2.356 211.534 >> > gc() >> used (Mb) gc trigger (Mb) max used (Mb) >> Ncells 924564 49.4 1590760 85.0 1590760 85.0 >> Vcells 145696189 1111.6 353180113 2694.6 332717020 2538.5 >> The same approach can be used to mutate the reads. In a first pass >> I address the simpler case where the substitution matrix is cycle >> independent, which allows for even more optimization. One additional >> optimization here is to use a raw vector instead of a vector of 1-letter >> character strings as the input and output of sample(): >> ## Here 's' must be a single character string. >> mutateString <- function(s, SubM) >> { >> bytes <- charToRaw(s) >> colbytes <- charToRaw(paste(colnames(SubM), collapse="")) >> for (letter in rownames(SubM)) { >> b <- charToRaw(letter) >> idx <- bytes == b >> bytes[bytes == b] <- sample(colbytes, sum(idx), >> replace=TRUE, prob=SubM[letter, ]) >> } >> rawToChar(bytes) >> } >> ## Here 'v' must be a Views object with a DNAString subject. >> mutateReads <- function(v, SubM) >> { >> totalsize <- length(subject(v)) >> bsize <- 1000000L >> blocks <- breakInBlocks(totalsize, bsize) >> tmp <- lapply(seq_len(length(blocks)), >> function(i) { >> s <- subseq(subject(v), >> start=start(blocks)[i], width=width(blocks)[i]) >> DNAString(mutateString(as.character(s), SubM)) >> }) >> ans_subject <- do.call(c, tmp) >> successiveViews(ans_subject, width(v)) >> } >> SubM <- matrix(0.0033, nrow=4, ncol=4) >> diag(SubM) <- 1 - sum(SubM[1, 2:4]) >> rownames(SubM) <- colnames(SubM) <- DNA_BASES >> > system.time(mreads <- mutateReads(reads, SubM)) >> user system elapsed >> 127.828 1.584 129.674 >> > gc() >> used (Mb) gc trigger (Mb) max used (Mb) >> Ncells 925420 49.5 1590760 85.0 1590760 85.0 >> Vcells 290946618 2219.8 450911592 3440.2 436207956 3328.1 >> Finally the more general case where the substitution matrix is cycle >> dependent: >> ## Here 'm' must be a raw matrix object with 1 read per row >> ## and 'SubMs' a tridimensional numeric array where 'SubMs[ , , i]' >> ## is the substitution matrix for cycle 'i'. >> mutateRawMatrix <- function(m, SubMs) >> { >> for (i in seq_len(ncol(m))) { >> bytes <- m[ , i] >> SubM <- SubMs[ , , i] >> colbytes <- charToRaw(paste(colnames(SubM), collapse="")) >> for (letter in rownames(SubM)) { >> b <- charToRaw(letter) >> idx <- bytes == b >> bytes[bytes == b] <- sample(colbytes, sum(idx), >> replace=TRUE, prob=SubM[letter, ]) >> } >> m[ , i] <- bytes >> } >> m >> } >> mkSubM <- function(mutprob=0) >> { >> SubM <- matrix(mutprob, nrow=4, ncol=4) >> diag(SubM) <- 1 - sum(SubM[1, 2:4]) >> rownames(SubM) <- colnames(SubM) <- DNA_BASES >> SubM >> } >> SubMs <- array( >> do.call(c, lapply(1:75, function(i) mkSubM(0.0033+(i-1)*0.002))), >> dim=c(4,4,75), dimnames=list(DNA_BASES, DNA_BASES, NULL)) >> mutateReads2 <- function(v, SubMs) >> { >> if (!isConstant(width(v))) >> stop("'v' must be rectangular") >> vwidth <- width(v)[1L] >> if (length(dim(SubMs)) != 3L || dim(SubMs)[3L] < vwidth) >> stop("invalid 'SubMs' (wrong dimensions)") >> totalsize <- length(v) >> bsize <- 250000L >> blocks <- breakInBlocks(totalsize, bsize) >> tmp <- lapply(seq_len(length(blocks)), >> function(i) { >> vv <- v[as.integer(blocks[i])] >> s <- subseq(subject(v), >> start=start(vv)[1L], end=end(vv)[length(vv)]) >> m <- matrix(charToRaw(as.character(s)), >> ncol=vwidth, byrow=TRUE) >> DNAString(rawToChar(t(mutateRawMatrix(m, SubMs)))) >> }) >> ans_subject <- do.call(c, tmp) >> successiveViews(ans_subject, width(v)) >> } >> > system.time(mreads2 <- mutateReads2(reads, SubMs)) >> user system elapsed >> 178.043 1.812 180.331 >> > gc() >> used (Mb) gc trigger (Mb) max used (Mb) >> Ncells 955107 51.1 1590760 85.0 1590760 85.0 >> Vcells 293551309 2239.7 497294029 3794.1 491180530 3747.5 >> Cheers, >> H. >> bull...@stat.berkeley.edu wrote: >>> Hi All, I have been trying (rather futilely) to optimize the performance >>> of some simulations which I am trying to do. Briefly, I have a set of >>> reads which are sampled from a genome. These reads are stored in a >>> DNAStringSet. The next thing I want to do is apply an error model to >>> 'mutate' the reads. >>> >>> Below, are three implementations for the same procedure. The first is very >>> slow and doesn't leverage vectorization. The second is faster, leverages >>> vectorization, but is done in a way that is counter-intuitive (one thing >>> to note is that in practice the substitution matrices might be cycle >>> dependent, hence why I walk over the columns). The third way seemed good, >>> but there is so much conversion that for 14,000,000 reads I use > 20 Gigs >>> of memory? Naively, that many reads should be about 1 Gig. >>> >>> I am looking for low-level Biostrings functionality. something like >>> applyString, or applyBase. I cannot fathom why this should take so long or >>> consume so much memory. >>> >>> thanks, jim >>> >>> ## the read-set (note that I want to do this for 1e7 reads) >>> BASES <- c("A", "C", "G", "T") >>> reads <- DNAStringSet(sapply(1:2000, function(i) { >>> paste(sample(BASES, replace = T, size = 75), >>> collapse = "") >>> })) >>> >>> SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99, >>> 0.0033, 0.0033, 0.0033, 0.0033, 0.99, 0.0033, >>> 0.0033, 0.0033, 0.0033, 0.99), >>> .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES)) >>> >>> f.naive <- function(ds) { >>> DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) { >>> paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size = >>> 1)), collapse = "") >>> })) >>> } >>> >>> f.lessSo <- function(ds) { >>> cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T', >>> 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T') >>> >>> V <- do.call(rbind, strsplit(as.character(ds), split = "")) >>> V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) { >>> sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE) >>> })), 1, paste, collapse = "") >>> >>> DNAStringSet(V) >>> } >>> >>> f.withViews <- function(ds) { >>> cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T', >>> 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T') >>> >>> V <- do.call(rbind, strsplit(as.character(ds), split = "")) >>> V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V), >>> function(i) { >>> sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE) >>> })))), collapse = "")) >>> DNAStringSet(Views(V, start = seq(1, length(V), by = 75), >>> end = seq(75, length(V), by = 75))) >>> } >>> >>> >>> system.time(r <- f.naive(reads)) >>> ### user system elapsed >>> ### 2.820 0.000 2.822 >>> >>> system.time(r <- f.lessSo(reads)) >>> ### user system elapsed >>> ### 0.190 0.000 0.189 >>> >>> system.time(r <- f.withViews(reads)) >>> ### user system elapsed >>> ### 0.090 0.010 0.097 >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> Bioc-sig-sequencing@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing