Thanks Phil, that is great! I WILL try this and let you know how it goes.
On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector <[email protected]>wrote: > Changbin - > Does > > 100 * sapply(matt$reads,function(x)sum(matt$reads >= > x))/length(matt$reads) > > give what you want? > > By the way, if you want to use a loop (there's nothing wrong with that), > then try to avoid the most common mistake that people make with loops in R: > having your result grow inside the loop. Here's a better way to use a loop > to solve your problem: > > cover_per_1 <- function(data){ > l = length(data) > output = numeric(l) > for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, > 0))/length(data) > output > } > > Using some random data, and comparing to your original cover_per function: > > dat = rnorm(1000) >> system.time(one <- cover_per(dat)) >> > user system elapsed > 0.816 0.000 0.824 > >> system.time(two <- cover_per_1(dat)) >> > user system elapsed > 0.792 0.000 0.805 > > Not that big a speedup, but it does increase quite a bit as the problem > gets > larger. > > There are two obvious ways to speed up your function: > 1) Eliminate the ifelse function, since automatic coersion from > logical to numeric does the same thing. > 2) Multiply by 100 and divide by the length outside the loop: > > cover_per_2 <- function(data){ > l = length(data) > output = numeric(l) > for(i in 1:l)output[i] = sum(data >= data[i]) > 100 * output / l > } > > system.time(three <- cover_per_2(dat)) >> > user system elapsed > 0.024 0.000 0.027 > > That makes the loop just about equivalent to the sapply solution: > > system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat)) >> > user system elapsed > 0.024 0.000 0.026 > > - Phil Spector > Statistical Computing Facility > Department of Statistics > UC Berkeley > [email protected] > > > > > > > > > > On Thu, 4 Nov 2010, Changbin Du wrote: > > HI, Dear R community, >> >> I have one data set like this, What I want to do is to calculate the >> cumulative coverage. The following codes works for small data set (#rows = >> 100), but when feed the whole data set, it still running after 24 hours. >> Can someone give some suggestions for long vector? >> >> id reads >> Contig79:1 4 >> Contig79:2 8 >> Contig79:3 13 >> Contig79:4 14 >> Contig79:5 17 >> Contig79:6 20 >> Contig79:7 25 >> Contig79:8 27 >> Contig79:9 32 >> Contig79:10 33 >> Contig79:11 34 >> >> >> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", >> sep="\t", skip=0, header=F,fill=T) # >> dim(matt) >> [1] 3384766 2 >> >> matt_plot<-function(matt, outputfile) { >> names(matt)<-c("id","reads") >> >> cover<-matt$reads >> >> >> #calculate the cumulative coverage. >> + cover_per<-function (data) { >> + output<-numeric(0) >> + for (i in data) { >> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) >> + output<-c(output, x) >> + } >> + return(output) >> + } >> >> >> result<-cover_per(cover) >> >> >> Thanks so much! >> >> >> -- >> Sincerely, >> Changbin >> -- >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> [email protected] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> -- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

