Re: [R] Speeding up prediction of survival estimates when using `survifit'

2010-08-31 Thread Frank Harrell



Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

On Mon, 30 Aug 2010, Ravi Varadhan wrote:


Hi,

I fit a Cox PH model to estimate the cause-specific hazards (in a competing 
risks setting).  Then , I compute the survival estimates for all the 
individuals in my data set using the `survfit' function.  I am currently 
playing with a data set that has about 6000 observations and 12 covariates.  I 
am finding that the survfit function is very slow.

Here is a simple simulation example (modified from Frank Harrell's example for 
`cph') that illustrates the problem:

#n - 500
set.seed(4321)

age - 50 + 12*rnorm(n)

sex - factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4)))

cens - 5 * runif(n)

h - 0.02 * exp(0.04 * (age-50) + 0.8 * (sex=='Female'))

dt - -log(runif(n))/h

e - ifelse(dt = cens, 1, 0)

dt - pmin(dt, cens)

Srv - Surv(dt, e)

f - coxph(Srv ~ age + sex, x=TRUE, y=TRUE)

system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))


When I run the above code with sample sizes, n, taking on values of 500, 1000, 
2000, and 4000, the time it takes for survfit to run are as follows:

# n - 500

system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))

  user  system elapsed
  0.160.000.15


# n - 1000

system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))

  user  system elapsed
  1.450.001.48


# n - 2000

system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))

  user  system elapsed
 10.190.00   10.25


# n - 4000

system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))

  user  system elapsed
 72.870.05   74.87


I eventually want to use `survfit' on a data set with roughly 50K observations, 
which I am afraid is going to be painfully slow.  I would much appreciate 
hints/suggestions on how to make `survfit' faster or any other faster 
alternatives.


Ravi,

If you don't need standard errors/confidence limits, the rms package's 
survest and related functions can speed things up greatly if you fit 
the model using cph(, surv=TRUE).  [cph calls coxph, and calls 
survfit once to estimate the underlying survival curve].


Frank



Thanks.

Best,
Ravi.


Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speeding up prediction of survival estimates when using `survifit'

2010-08-30 Thread Ravi Varadhan
Hi,

I fit a Cox PH model to estimate the cause-specific hazards (in a competing 
risks setting).  Then , I compute the survival estimates for all the 
individuals in my data set using the `survfit' function.  I am currently 
playing with a data set that has about 6000 observations and 12 covariates.  I 
am finding that the survfit function is very slow.  

Here is a simple simulation example (modified from Frank Harrell's example for 
`cph') that illustrates the problem:

#n - 500
set.seed(4321) 

age - 50 + 12*rnorm(n) 

sex - factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) 

cens - 5 * runif(n) 

h - 0.02 * exp(0.04 * (age-50) + 0.8 * (sex=='Female')) 

dt - -log(runif(n))/h 

e - ifelse(dt = cens, 1, 0) 

dt - pmin(dt, cens) 

Srv - Surv(dt, e)

 f - coxph(Srv ~ age + sex, x=TRUE, y=TRUE) 

system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))


When I run the above code with sample sizes, n, taking on values of 500, 1000, 
2000, and 4000, the time it takes for survfit to run are as follows:

# n - 500
 system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))
   user  system elapsed 
   0.160.000.15 


# n - 1000
 system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))
   user  system elapsed 
   1.450.001.48 


# n - 2000
 system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))
   user  system elapsed 
  10.190.00   10.25 


# n - 4000
 system.time(ans - survfit(f, type=aalen, se.fit=FALSE, newdata=f$x))
   user  system elapsed 
  72.870.05   74.87 


I eventually want to use `survfit' on a data set with roughly 50K observations, 
which I am afraid is going to be painfully slow.  I would much appreciate 
hints/suggestions on how to make `survfit' faster or any other faster 
alternatives.  

Thanks.

Best,
Ravi.


Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.