Thanks Phil, that is great! I WILL try this and let you know how it goes.


On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector <[email protected]>wrote:

> Changbin -
>   Does
>
>    100 * sapply(matt$reads,function(x)sum(matt$reads >=
> x))/length(matt$reads)
>
> give what you want?
>
>    By the way, if you want to use a loop (there's nothing wrong with that),
> then try to avoid the most common mistake that people make with loops in R:
> having your result grow inside the loop.  Here's a better way to use a loop
> to solve your problem:
>
> cover_per_1 <- function(data){
>   l = length(data)
>   output = numeric(l)
>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> 0))/length(data)
>   output
> }
>
> Using some random data, and comparing to your original cover_per function:
>
>  dat = rnorm(1000)
>> system.time(one <- cover_per(dat))
>>
>   user  system elapsed
>  0.816   0.000   0.824
>
>> system.time(two <- cover_per_1(dat))
>>
>   user  system elapsed
>  0.792   0.000   0.805
>
> Not that big a speedup, but it does increase quite a bit as the problem
> gets
> larger.
>
> There are two obvious ways to speed up your function:
>   1)  Eliminate the ifelse function, since automatic coersion from
>       logical to numeric does the same thing.
>   2)  Multiply by 100 and divide by the length outside the loop:
>
> cover_per_2 <- function(data){
>   l = length(data)
>   output = numeric(l)
>   for(i in 1:l)output[i] = sum(data >= data[i])
>   100 * output / l
> }
>
>  system.time(three <- cover_per_2(dat))
>>
>   user  system elapsed
>  0.024   0.000   0.027
>
> That makes the loop just about equivalent to the sapply solution:
>
>  system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))
>>
>   user  system elapsed
>  0.024   0.000   0.026
>
>                                        - Phil Spector
>                                         Statistical Computing Facility
>                                         Department of Statistics
>                                         UC Berkeley
>                                         [email protected]
>
>
>
>
>
>
>
>
>
> On Thu, 4 Nov 2010, Changbin Du wrote:
>
>  HI, Dear R community,
>>
>> I have one data set like this,  What I want to do is to calculate the
>> cumulative coverage. The following codes works for small data set (#rows =
>> 100), but when feed the whole data set,  it still running after 24 hours.
>> Can someone give some suggestions for long vector?
>>
>> id                reads
>> Contig79:1    4
>> Contig79:2    8
>> Contig79:3    13
>> Contig79:4    14
>> Contig79:5    17
>> Contig79:6    20
>> Contig79:7    25
>> Contig79:8    27
>> Contig79:9    32
>> Contig79:10    33
>> Contig79:11    34
>>
>>
>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>> sep="\t", skip=0, header=F,fill=T) #
>> dim(matt)
>> [1] 3384766       2
>>
>> matt_plot<-function(matt, outputfile) {
>> names(matt)<-c("id","reads")
>>
>> cover<-matt$reads
>>
>>
>> #calculate the cumulative coverage.
>> + cover_per<-function (data) {
>> + output<-numeric(0)
>> + for (i in data) {
>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>> +           output<-c(output, x)
>> +                 }
>> + return(output)
>> + }
>>
>>
>> result<-cover_per(cover)
>>
>>
>> Thanks so much!
>>
>>
>> --
>> Sincerely,
>> Changbin
>> --
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [email protected] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>


-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to