Thanks, William. It gave me a lesson.

On Fri, Nov 5, 2010 at 9:58 AM, William Dunlap <[email protected]> wrote:

> The following cover_per_3 uses sorting to solve
> the problem more quickly.  It still has room
> for improvement.
>
> cover_per_3 <- function (data)
> {
>    n <- length(data)
>    o <- rev(order(data))
>    sdata <- data[o]
>    r <- rle(sdata)$lengths
>    output <- numeric(n)
>    output[o] <- rep(cumsum(r), r)
>    100 * output/n
> }
>
> (The ecdf function would probabably also do the
> job quickly.)
>
> When trying to work on problems like this I find
> it most fruitful to work on smaller datasets and
> see how the time grows with the size of the data,
> instead of seeing how many days a it takes on a huge
> dataset.  E.g., the following compares times for
> your original function, Phil Spector's simple cleanup
> of your function, and the sort based approach for
> vectors of length 5 and 10 thousand.
>
> > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>   user  system elapsed
>  38.21    0.00   38.41
>   user  system elapsed
>   0.86    0.00    0.86
>   user  system elapsed
>      0       0       0
> > identical(v_3,v)
> [1] TRUE
> > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>   user  system elapsed
>  158.48    0.07  159.31
>   user  system elapsed
>   3.23    0.00    3.25
>   user  system elapsed
>   0.02    0.00    0.02
> > identical(v_3,v)
> [1] TRUE
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Changbin Du
> > Sent: Friday, November 05, 2010 9:14 AM
> > To: Phil Spector
> > Cc: <[email protected]>
> > Subject: Re: [R] how to work with long vectors
> >
> > HI, Phil,
> >
> > I used the following codes and run it overnight for 15 hours,
> > this morning,
> > I stopped it. It seems it is still not efficient.
> >
> >
> > >
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > > names(matt)<-c("id","reads")
> >
> > > dim(matt)
> > [1] 3384766       2
> >
> > >  cover<-matt$reads
> >
> > > cover_per_2 <- function(data){
> > +   l = length(data)
> > +   output = numeric(l)
> > +   for(i in 1:l)output[i] = sum(data >= data[i])
> > +   100 * output / l
> > + }
> >
> > > result3<-cover_per_2(cover)
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du
> > <[email protected]> wrote:
> >
> > > Thanks Phil, that is great! I WILL try this and let you
> > know how it goes.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector
> > <[email protected]>wrote:
> > >
> > >> Changbin -
> > >>   Does
> > >>
> > >>    100 * sapply(matt$reads,function(x)sum(matt$reads >=
> > >> x))/length(matt$reads)
> > >>
> > >> give what you want?
> > >>
> > >>    By the way, if you want to use a loop (there's nothing
> > wrong with
> > >> that),
> > >> then try to avoid the most common mistake that people make
> > with loops in
> > >> R:
> > >> having your result grow inside the loop.  Here's a better
> > way to use a
> > >> loop
> > >> to solve your problem:
> > >>
> > >> cover_per_1 <- function(data){
> > >>   l = length(data)
> > >>   output = numeric(l)
> > >>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> > >> 0))/length(data)
> > >>   output
> > >> }
> > >>
> > >> Using some random data, and comparing to your original
> > cover_per function:
> > >>
> > >>  dat = rnorm(1000)
> > >>> system.time(one <- cover_per(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.816   0.000   0.824
> > >>
> > >>> system.time(two <- cover_per_1(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.792   0.000   0.805
> > >>
> > >> Not that big a speedup, but it does increase quite a bit
> > as the problem
> > >> gets
> > >> larger.
> > >>
> > >> There are two obvious ways to speed up your function:
> > >>   1)  Eliminate the ifelse function, since automatic coersion from
> > >>       logical to numeric does the same thing.
> > >>   2)  Multiply by 100 and divide by the length outside the loop:
> > >>
> > >> cover_per_2 <- function(data){
> > >>   l = length(data)
> > >>   output = numeric(l)
> > >>   for(i in 1:l)output[i] = sum(data >= data[i])
> > >>   100 * output / l
> > >> }
> > >>
> > >>  system.time(three <- cover_per_2(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.024   0.000   0.027
> > >>
> > >> That makes the loop just about equivalent to the sapply solution:
> > >>
> > >>  system.time(four <- 100*sapply(dat,function(x)sum(dat >=
> > x))/length(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.024   0.000   0.026
> > >>
> > >>                                        - Phil Spector
> > >>                                         Statistical
> > Computing Facility
> > >>                                         Department of Statistics
> > >>                                         UC Berkeley
> > >>                                         [email protected]
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, 4 Nov 2010, Changbin Du wrote:
> > >>
> > >>  HI, Dear R community,
> > >>>
> > >>> I have one data set like this,  What I want to do is to
> > calculate the
> > >>> cumulative coverage. The following codes works for small
> > data set (#rows
> > >>> =
> > >>> 100), but when feed the whole data set,  it still running
> > after 24 hours.
> > >>> Can someone give some suggestions for long vector?
> > >>>
> > >>> id                reads
> > >>> Contig79:1    4
> > >>> Contig79:2    8
> > >>> Contig79:3    13
> > >>> Contig79:4    14
> > >>> Contig79:5    17
> > >>> Contig79:6    20
> > >>> Contig79:7    25
> > >>> Contig79:8    27
> > >>> Contig79:9    32
> > >>> Contig79:10    33
> > >>> Contig79:11    34
> > >>>
> > >>>
> > >>>
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > >>> sep="\t", skip=0, header=F,fill=T) #
> > >>> dim(matt)
> > >>> [1] 3384766       2
> > >>>
> > >>> matt_plot<-function(matt, outputfile) {
> > >>> names(matt)<-c("id","reads")
> > >>>
> > >>> cover<-matt$reads
> > >>>
> > >>>
> > >>> #calculate the cumulative coverage.
> > >>> + cover_per<-function (data) {
> > >>> + output<-numeric(0)
> > >>> + for (i in data) {
> > >>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> > >>> +           output<-c(output, x)
> > >>> +                 }
> > >>> + return(output)
> > >>> + }
> > >>>
> > >>>
> > >>> result<-cover_per(cover)
> > >>>
> > >>>
> > >>> Thanks so much!
> > >>>
> > >>>
> > >>> --
> > >>> Sincerely,
> > >>> Changbin
> > >>> --
> > >>>
> > >>>        [[alternative HTML version deleted]]
> > >>>
> > >>> ______________________________________________
> > >>> [email protected] mailing list
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide
> > >>> http://www.R-project.org/posting-guide.html
> > >>> and provide commented, minimal, self-contained, reproducible code.
> > >>>
> > >>>
> > >
> > >
> > > --
> > > Sincerely,
> > > Changbin
> > > --
> > >
> > > Changbin Du
> > > DOE Joint Genome Institute
> > > Bldg 400 Rm 457
> > > 2800 Mitchell Dr
> > > Walnut Creet, CA 94598
> > > Phone: 925-927-2856
> > >
> > >
> > >
> >
> >
> > --
> > Sincerely,
> > Changbin
> > --
> >
> > Changbin Du
> > DOE Joint Genome Institute
> > Bldg 400 Rm 457
> > 2800 Mitchell Dr
> > Walnut Creet, CA 94598
> > Phone: 925-927-2856
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [email protected] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>



-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to