HI, Phil,

I used the following codes and run it overnight for 15 hours, this morning,
I stopped it. It seems it is still not efficient.


>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
> names(matt)<-c("id","reads")

> dim(matt)
[1] 3384766       2

>  cover<-matt$reads

> cover_per_2 <- function(data){
+   l = length(data)
+   output = numeric(l)
+   for(i in 1:l)output[i] = sum(data >= data[i])
+   100 * output / l
+ }

> result3<-cover_per_2(cover)












On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du <[email protected]> wrote:

> Thanks Phil, that is great! I WILL try this and let you know how it goes.
>
>
>
>
> On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
> <[email protected]>wrote:
>
>> Changbin -
>>   Does
>>
>>    100 * sapply(matt$reads,function(x)sum(matt$reads >=
>> x))/length(matt$reads)
>>
>> give what you want?
>>
>>    By the way, if you want to use a loop (there's nothing wrong with
>> that),
>> then try to avoid the most common mistake that people make with loops in
>> R:
>> having your result grow inside the loop.  Here's a better way to use a
>> loop
>> to solve your problem:
>>
>> cover_per_1 <- function(data){
>>   l = length(data)
>>   output = numeric(l)
>>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
>> 0))/length(data)
>>   output
>> }
>>
>> Using some random data, and comparing to your original cover_per function:
>>
>>  dat = rnorm(1000)
>>> system.time(one <- cover_per(dat))
>>>
>>   user  system elapsed
>>  0.816   0.000   0.824
>>
>>> system.time(two <- cover_per_1(dat))
>>>
>>   user  system elapsed
>>  0.792   0.000   0.805
>>
>> Not that big a speedup, but it does increase quite a bit as the problem
>> gets
>> larger.
>>
>> There are two obvious ways to speed up your function:
>>   1)  Eliminate the ifelse function, since automatic coersion from
>>       logical to numeric does the same thing.
>>   2)  Multiply by 100 and divide by the length outside the loop:
>>
>> cover_per_2 <- function(data){
>>   l = length(data)
>>   output = numeric(l)
>>   for(i in 1:l)output[i] = sum(data >= data[i])
>>   100 * output / l
>> }
>>
>>  system.time(three <- cover_per_2(dat))
>>>
>>   user  system elapsed
>>  0.024   0.000   0.027
>>
>> That makes the loop just about equivalent to the sapply solution:
>>
>>  system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))
>>>
>>   user  system elapsed
>>  0.024   0.000   0.026
>>
>>                                        - Phil Spector
>>                                         Statistical Computing Facility
>>                                         Department of Statistics
>>                                         UC Berkeley
>>                                         [email protected]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, 4 Nov 2010, Changbin Du wrote:
>>
>>  HI, Dear R community,
>>>
>>> I have one data set like this,  What I want to do is to calculate the
>>> cumulative coverage. The following codes works for small data set (#rows
>>> =
>>> 100), but when feed the whole data set,  it still running after 24 hours.
>>> Can someone give some suggestions for long vector?
>>>
>>> id                reads
>>> Contig79:1    4
>>> Contig79:2    8
>>> Contig79:3    13
>>> Contig79:4    14
>>> Contig79:5    17
>>> Contig79:6    20
>>> Contig79:7    25
>>> Contig79:8    27
>>> Contig79:9    32
>>> Contig79:10    33
>>> Contig79:11    34
>>>
>>>
>>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>>> sep="\t", skip=0, header=F,fill=T) #
>>> dim(matt)
>>> [1] 3384766       2
>>>
>>> matt_plot<-function(matt, outputfile) {
>>> names(matt)<-c("id","reads")
>>>
>>> cover<-matt$reads
>>>
>>>
>>> #calculate the cumulative coverage.
>>> + cover_per<-function (data) {
>>> + output<-numeric(0)
>>> + for (i in data) {
>>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>>> +           output<-c(output, x)
>>> +                 }
>>> + return(output)
>>> + }
>>>
>>>
>>> result<-cover_per(cover)
>>>
>>>
>>> Thanks so much!
>>>
>>>
>>> --
>>> Sincerely,
>>> Changbin
>>> --
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [email protected] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>
>
> --
> Sincerely,
> Changbin
> --
>
> Changbin Du
> DOE Joint Genome Institute
> Bldg 400 Rm 457
> 2800 Mitchell Dr
> Walnut Creet, CA 94598
> Phone: 925-927-2856
>
>
>


-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to