Thanks Frank and Steve. I rewrite the R code as follows. # m is the number of fold to split sample, n is the loop number of cross validation
library(caret) calcvnb<-function(formula,dat,m,n) { cvnb<-rep(0,20000) dim(cvnb)<-c(200,100) for (i in 1:n) { group<-rep(0,length=110) sg<-createFolds(dat$LN,k=m) for (k in 1:m) { group[sg[[k]]]<-k } pre<-rep(0,length=110) data1<-cbind(dat,group,pre) for (j in 1:m) { dev<-data1[data1$group!=j,] val<-data1[data1$group==j,] fit<-lrm(formula,data=dev,x=T,y=T) pre1<-predict(fit,val,type="fitted") data1[group==j,]$pre<-pre1 } dcaop<-dca(data1$LN,data1$pre,prob=("Y")) nb<-100*(dcaop[,1]-dcaop[,2]) cvnb[i,1:99]<-nb } mcvnb<-colMeans(cvnb) return(mcvnb) } # apply the function in main code optnb1<-calcvnb(formula=LN~factor(MTSTAGE)+factor(GRADE)+LVINVAS+P53,dat=cMain,m=10,n=200) However, applied to my data, a error occurred after several loops "Error in contrasts <- '('*tmp*',value="contr.treatment"): contrasts can be applied only to factors with 2 or more levels". Whats wrong with my code and how to handle it ? Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China 2010/1/21 zhu yao <mailzhu...@gmail.com> > Hi, everyone: > > I ask for help about translating a stata program into R. > > The program perform cross validation as it stated. > > #1. Randomly divide the data set into 10 sets of equal size, ensuring equal > numbers of events in each set > #2. Fit the model leaving out the 1st set > #3. Apply the fitted model in (2) to the 1st set to obtain the predicted > probability of a prostate cancer diagnosis. > #4. Repeat steps (2) to (3) leaving out and then applying the fitted model > to the ith group, i = 2, 3... 10. Every subject now has a predicted > probability of a prostate cancer diagnosis. > #5. Using the predicted probabilities, compute the net benefit at various > threshold probabilities. > #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each > threshold probability is the mean across the 200 replications. > > ================================================================= > First is stata code. > > forvalues i=1(1)200 { > local event="cancer" > local predictors1 = "total_psa" > local predictors2 = "total_psa free_psa" > local prediction1 = "base" > local prediction2 = "full" > > g `prediction1'=. > g `prediction2'=. > > quietly g u = uniform() > sort `event' u > g set = mod(_n, 10) + 1 > > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > tempfile dca`i' > quietly dca `event' `prediction1' `prediction2', graphoff > saving("`dca`i''") > > drop u set `prediction1' `prediction2' > } > > > use "`dca1'", clear > forvalues i=2(1)200 { > append using "`dca`i''" > } > > collapse all none modelp1 modelp2, by(threshold) > > save "cross validation dca output.dta", replace > > twoway(line none all modelp1 modelp2 threshold, sort) > > ================================================================= > Here is my draft of R code. cMain is my dataset. > > predca<-rep(0,40000) > dim(predca)<-c(200,200) > > for (i in 1:200) { > cvgroup<-rep(1:10,length=110) > cvgroup<-sample(cvgroup) > cvpre<-rep(0,length=110) > cvMain<-cbind(cMain,cvgroup,cvpre) > > for (j in 1:10) { > cvdev<-cvMain[cvMain$cvgroup!=j,] > cvval<-cvMain[cvMain$cvgroup==j,] > cvfit<-lrm(Y~X,data=cvdev,x=T,y=T) > cvprej<-predict(cvfit,cvval,type="fitted") > > #put the fitted value in dataset > cvMain[cvgroup==j,]$cvpre<-prej > } > > cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y")) > cvnb<-100*(cvdcaop[,1]-cvdcaop[,2]) > cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4]) > cvnr<-cvnb/cvtpthres > predca[cvn,1:99]<-cvnb > predca[cvn,101:199]<-cvnr > } > > ================================================================= > > My questions are > 1. How to ensure equal numbers of events in each set in R? > 2. A part of stata code is > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > I don't understand what's the difference between prediction1 and > prediction2 > 3. Is my code right? > > Thanks ! > > Yao Zhu > Department of Urology > Fudan University Shanghai Cancer Center > Shanghai, China > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.