There was a numerical typo below, I said the sample sizes were 5 and 10 thousand, I should have said 10 and 20 thousand (the size argument to sample()).
Also, I timed cover_per_2 and _3 for size 200,000 and gots times of 338 and 0.12 seconds, respectively. Growing the problem by a factor to 10 made cover_per_2 used 100 times more time and cover_per_3 c. 10 times more (the times are too small to get an accurate ratio). Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of William Dunlap > Sent: Friday, November 05, 2010 9:58 AM > To: Changbin Du > Cc: [email protected] > Subject: Re: [R] how to work with long vectors > > The following cover_per_3 uses sorting to solve > the problem more quickly. It still has room > for improvement. > > cover_per_3 <- function (data) > { > n <- length(data) > o <- rev(order(data)) > sdata <- data[o] > r <- rle(sdata)$lengths > output <- numeric(n) > output[o] <- rep(cumsum(r), r) > 100 * output/n > } > > (The ecdf function would probabably also do the > job quickly.) > > When trying to work on problems like this I find > it most fruitful to work on smaller datasets and > see how the time grows with the size of the data, > instead of seeing how many days a it takes on a huge > dataset. E.g., the following compares times for > your original function, Phil Spector's simple cleanup > of your function, and the sort based approach for > vectors of length 5 and 10 thousand. > > > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <- > cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; > print(system.time(v_3 <- cover_per_3(z))) > user system elapsed > 38.21 0.00 38.41 > user system elapsed > 0.86 0.00 0.86 > user system elapsed > 0 0 0 > > identical(v_3,v) > [1] TRUE > > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <- > cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; > print(system.time(v_3 <- cover_per_3(z))) > user system elapsed > 158.48 0.07 159.31 > user system elapsed > 3.23 0.00 3.25 > user system elapsed > 0.02 0.00 0.02 > > identical(v_3,v) > [1] TRUE > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of Changbin Du > > Sent: Friday, November 05, 2010 9:14 AM > > To: Phil Spector > > Cc: <[email protected]> > > Subject: Re: [R] how to work with long vectors > > > > HI, Phil, > > > > I used the following codes and run it overnight for 15 hours, > > this morning, > > I stopped it. It seems it is still not efficient. > > > > > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL > UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW> > ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > > > names(matt)<-c("id","reads") > > > > > dim(matt) > > [1] 3384766 2 > > > > > cover<-matt$reads > > > > > cover_per_2 <- function(data){ > > + l = length(data) > > + output = numeric(l) > > + for(i in 1:l)output[i] = sum(data >= data[i]) > > + 100 * output / l > > + } > > > > > result3<-cover_per_2(cover) > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du > > <[email protected]> wrote: > > > > > Thanks Phil, that is great! I WILL try this and let you > > know how it goes. > > > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector > > <[email protected]>wrote: > > > > > >> Changbin - > > >> Does > > >> > > >> 100 * sapply(matt$reads,function(x)sum(matt$reads >= > > >> x))/length(matt$reads) > > >> > > >> give what you want? > > >> > > >> By the way, if you want to use a loop (there's nothing > > wrong with > > >> that), > > >> then try to avoid the most common mistake that people make > > with loops in > > >> R: > > >> having your result grow inside the loop. Here's a better > > way to use a > > >> loop > > >> to solve your problem: > > >> > > >> cover_per_1 <- function(data){ > > >> l = length(data) > > >> output = numeric(l) > > >> for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, > > >> 0))/length(data) > > >> output > > >> } > > >> > > >> Using some random data, and comparing to your original > > cover_per function: > > >> > > >> dat = rnorm(1000) > > >>> system.time(one <- cover_per(dat)) > > >>> > > >> user system elapsed > > >> 0.816 0.000 0.824 > > >> > > >>> system.time(two <- cover_per_1(dat)) > > >>> > > >> user system elapsed > > >> 0.792 0.000 0.805 > > >> > > >> Not that big a speedup, but it does increase quite a bit > > as the problem > > >> gets > > >> larger. > > >> > > >> There are two obvious ways to speed up your function: > > >> 1) Eliminate the ifelse function, since automatic > coersion from > > >> logical to numeric does the same thing. > > >> 2) Multiply by 100 and divide by the length outside the loop: > > >> > > >> cover_per_2 <- function(data){ > > >> l = length(data) > > >> output = numeric(l) > > >> for(i in 1:l)output[i] = sum(data >= data[i]) > > >> 100 * output / l > > >> } > > >> > > >> system.time(three <- cover_per_2(dat)) > > >>> > > >> user system elapsed > > >> 0.024 0.000 0.027 > > >> > > >> That makes the loop just about equivalent to the sapply solution: > > >> > > >> system.time(four <- 100*sapply(dat,function(x)sum(dat >= > > x))/length(dat)) > > >>> > > >> user system elapsed > > >> 0.024 0.000 0.026 > > >> > > >> - Phil Spector > > >> Statistical > > Computing Facility > > >> Department of Statistics > > >> UC Berkeley > > >> [email protected] > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Thu, 4 Nov 2010, Changbin Du wrote: > > >> > > >> HI, Dear R community, > > >>> > > >>> I have one data set like this, What I want to do is to > > calculate the > > >>> cumulative coverage. The following codes works for small > > data set (#rows > > >>> = > > >>> 100), but when feed the whole data set, it still running > > after 24 hours. > > >>> Can someone give some suggestions for long vector? > > >>> > > >>> id reads > > >>> Contig79:1 4 > > >>> Contig79:2 8 > > >>> Contig79:3 13 > > >>> Contig79:4 14 > > >>> Contig79:5 17 > > >>> Contig79:6 20 > > >>> Contig79:7 25 > > >>> Contig79:8 27 > > >>> Contig79:9 32 > > >>> Contig79:10 33 > > >>> Contig79:11 34 > > >>> > > >>> > > >>> > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL > UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW> > ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > >>> sep="\t", skip=0, header=F,fill=T) # > > >>> dim(matt) > > >>> [1] 3384766 2 > > >>> > > >>> matt_plot<-function(matt, outputfile) { > > >>> names(matt)<-c("id","reads") > > >>> > > >>> cover<-matt$reads > > >>> > > >>> > > >>> #calculate the cumulative coverage. > > >>> + cover_per<-function (data) { > > >>> + output<-numeric(0) > > >>> + for (i in data) { > > >>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > > >>> + output<-c(output, x) > > >>> + } > > >>> + return(output) > > >>> + } > > >>> > > >>> > > >>> result<-cover_per(cover) > > >>> > > >>> > > >>> Thanks so much! > > >>> > > >>> > > >>> -- > > >>> Sincerely, > > >>> Changbin > > >>> -- > > >>> > > >>> [[alternative HTML version deleted]] > > >>> > > >>> ______________________________________________ > > >>> [email protected] mailing list > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>> PLEASE do read the posting guide > > >>> http://www.R-project.org/posting-guide.html > > >>> and provide commented, minimal, self-contained, > reproducible code. > > >>> > > >>> > > > > > > > > > -- > > > Sincerely, > > > Changbin > > > -- > > > > > > Changbin Du > > > DOE Joint Genome Institute > > > Bldg 400 Rm 457 > > > 2800 Mitchell Dr > > > Walnut Creet, CA 94598 > > > Phone: 925-927-2856 > > > > > > > > > > > > > > > -- > > Sincerely, > > Changbin > > -- > > > > Changbin Du > > DOE Joint Genome Institute > > Bldg 400 Rm 457 > > 2800 Mitchell Dr > > Walnut Creet, CA 94598 > > Phone: 925-927-2856 > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

