Re: [R] cross validation function translated from stata

zhu yao Thu, 21 Jan 2010 18:04:45 -0800

Thanks Frank and Steve.
I rewrite the R code as follows.

# m is the number of fold to split sample, n is the loop number of cross
validation


library(caret)

calcvnb<-function(formula,dat,m,n)
{

cvnb<-rep(0,20000)
dim(cvnb)<-c(200,100)
 for (i in 1:n)
{
 group<-rep(0,length=110)
sg<-createFolds(dat$LN,k=m)
for (k in 1:m)
{
group[sg[[k]]]<-k
}
 pre<-rep(0,length=110)
data1<-cbind(dat,group,pre)
 for (j in 1:m)
{
dev<-data1[data1$group!=j,]
val<-data1[data1$group==j,]
 fit<-lrm(formula,data=dev,x=T,y=T)
pre1<-predict(fit,val,type="fitted")

data1[group==j,]$pre<-pre1
}

dcaop<-dca(data1$LN,data1$pre,prob=("Y"))
nb<-100*(dcaop[,1]-dcaop[,2])
cvnb[i,1:99]<-nb
}
mcvnb<-colMeans(cvnb)
return(mcvnb)
}

# apply the function in main code

optnb1<-calcvnb(formula=LN~factor(MTSTAGE)+factor(GRADE)+LVINVAS+P53,dat=cMain,m=10,n=200)


However, applied to my data,  a error occurred after several loops "Error in
contrasts <- '('*tmp*',value="contr.treatment"):
contrasts can be applied only to factors with 2 or more levels".

Whats wrong with my code and how to handle it ?

Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China
Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China


2010/1/21 zhu yao <mailzhu...@gmail.com>

> Hi, everyone:
>
> I ask for help about translating a stata program into R.
>
> The program perform cross validation as it stated.
>
> #1. Randomly divide the data set into 10 sets of equal size, ensuring equal
> numbers of events in each set
> #2. Fit the model leaving out the 1st set
> #3. Apply the fitted model in (2) to the 1st set to obtain the predicted
> probability of a prostate cancer diagnosis.
> #4. Repeat steps (2) to (3) leaving out and then applying the fitted model
> to the ith group, i = 2, 3... 10. Every subject now has a predicted
> probability of a prostate cancer diagnosis.
> #5. Using the predicted probabilities, compute the net benefit at various
> threshold probabilities.
> #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
> threshold probability is the mean across the 200 replications.
>
> =================================================================
> First is stata code.
>
> forvalues i=1(1)200 {
> local event="cancer"
> local predictors1 = "total_psa"
> local predictors2 = "total_psa free_psa"
> local prediction1 = "base"
> local prediction2 = "full"
>
> g `prediction1'=.
> g `prediction2'=.
>
> quietly g u = uniform()
> sort `event' u
> g set = mod(_n, 10) + 1
>
> forvalues j=1(1)10{
>  quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
>  quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
>
>
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
>  quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
> tempfile dca`i'
> quietly dca `event' `prediction1' `prediction2', graphoff
> saving("`dca`i''")
>
> drop u set `prediction1' `prediction2'
> }
>
>
> use "`dca1'", clear
> forvalues i=2(1)200 {
>  append using "`dca`i''"
> }
>
> collapse all none modelp1 modelp2, by(threshold)
>
> save "cross validation dca output.dta", replace
>
> twoway(line none all modelp1 modelp2 threshold, sort)
>
> =================================================================
> Here is my draft of R code. cMain is my dataset.
>
> predca<-rep(0,40000)
> dim(predca)<-c(200,200)
>
>  for (i in 1:200) {
>  cvgroup<-rep(1:10,length=110)
> cvgroup<-sample(cvgroup)
>  cvpre<-rep(0,length=110)
> cvMain<-cbind(cMain,cvgroup,cvpre)
>
>  for (j in 1:10) {
> cvdev<-cvMain[cvMain$cvgroup!=j,]
> cvval<-cvMain[cvMain$cvgroup==j,]
>  cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
> cvprej<-predict(cvfit,cvval,type="fitted")
>
> #put the fitted value in dataset
> cvMain[cvgroup==j,]$cvpre<-prej
>  }
>
> cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
>  cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
> cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
>  cvnr<-cvnb/cvtpthres
> predca[cvn,1:99]<-cvnb
> predca[cvn,101:199]<-cvnr
> }
>
> =================================================================
>
> My questions are
> 1. How to ensure equal numbers of events in each set in R?
> 2. A part of stata code is
>                         forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
>  quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
>  drop ptemp
>
> quietly logit `event' `predictors2' if set~=`j'
>  quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
>  drop ptemp
> }
>   I don't understand what's the difference between prediction1 and
> prediction2
> 3. Is my code right?
>
> Thanks !
>
> Yao Zhu
> Department of Urology
> Fudan University Shanghai Cancer Center
> Shanghai, China
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cross validation function translated from stata

Reply via email to