This entire procedure is not valid. You cannot use a penalized method for selecting variables then use an unpenalized procedure on those selected.
Frank David Winsemius wrote > On Sep 28, 2013, at 2:39 AM, E Joffe wrote: > >> Hi all, >> >> I am using COX LASSO (glmnet / coxnet) regression to analyze a >> dataset of >> 394 obs. / 268 vars. >> I use the following procedure: >> 1. Construct a coxnet on the entire dataset (by cv.glmnet) >> 2. Pick the significant features by selecting the non-zero coefficient >> under the best lambda selected by the model >> 3. Build a coxph model with bi-directional stepwise feature selection >> limited to the coxnet selected features. > > I was a bit puzzled by the third step. Once you had a reduced model > from glmnet, what was the statistical basis for further elimination of > variables? > > (Quite apart from the statistical issues, I was rather surprised that > this procedure even produced results since the 'step' function is not > described in the 'stats' package as applying to 'coxph' model objects.) > >> >> To validate the model I use both Brier score (library=peperr) and >> Harrel's [Harrell] >> C-Index (library=Hmisc) with a bootstrap of 50 iterations. >> >> >> MY QUESTION : I am getting fairly good C-Index (0.76) and Brier >> (0.07) >> values for the models however per the coxnet the %Dev explained by >> the model >> is at best 0.27 and when I plot the survfit of the coxph the plotted >> confidence interval is very large. > > How many events did you have? (The width of CI's is most importantly > dependent on event counts and not particularly improved by a high case > count. The power considerations are very similar to those of a > binomial test.) > > >> What am I missing here ? > > Perhaps sufficient events? (You also seem to be missing a description > of the study goals.) > > > -- > David. > >> >> %DEV=27% >> >> >> >> Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) >> C-Index - 0.76 ($cIndex -> [1] 0.763) >> >> >> >> DATA: [Private Health Information - can't publish] 394 obs./268 vars. >> >> CODE (need to define a dataset with 'time' and 'status' variables): >> >> library("survival") >> library ("glmnet") >> library ("c060") >> library ("peperr") >> library ("Hmisc") >> >> #creat Y (survival matrix) for glmnet >> surv_obj <- Surv(dataset$time,dataset$status) >> >> >> ## tranform categorical variables into binary variables with >> dummy for >> dataset >> predict_matrix <- model.matrix(~ ., data=dataset, >> contrasts.arg = lapply >> (dataset[,sapply(dataset, is.factor)], contrasts)) >> >> ## remove the statu/time variables from the predictor matrix (x) >> for >> glmnet >> predict_matrix <- subset (predict_matrix, select=c(-time,-status)) >> >> ## create a glmnet cox object using lasso regularization and cross >> validation >> glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") >> >> ## get the glmnet model on the full dataset >> glmnet.obj <- glmnet.cv$glmnet.fit >> >> # find lambda index for the models with least partial likelihood >> deviance (by cv.glmnet) >> optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous >> model use lambda.1se >> lambda.index <- which(glmnet.obj$lambda==optimal.lambda) >> >> >> # take beta for optimal lambda >> optimal.beta <- glmnet.obj$beta[,lambda.index] >> >> # find non zero beta coef >> nonzero.coef <- abs(optimal.beta)>0 >> selectedBeta <- optimal.beta[nonzero.coef] >> >> # take only covariates for which beta is not zero >> selectedVar <- predict_matrix[,nonzero.coef] >> >> # create a dataframe for trainSet with time, status and selected >> variables in binary representation for evaluation in pec >> reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) >> >> # glmnet.cox only with meaningful features selected by stepwise >> bidirectional AIC feature selection >> glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ >> .,data=reformat_dataSet),direction="both") >> >> >> >> >> ##-------------------------------------------------------------------------- >> ----------------------------- >> ## MODEL PERFORMANCE >> >> ##-------------------------------------------------------------------------- >> ----------------------------- >> ## >> >> >> ## Calculate the Brier score - pec does its own bootstrap so this >> function runs on i=51 (i.e., whole trainset) >> >> ## Brier score calculation to cox-glmnet >> peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, >> fit.fun=fit.coxph, load.all=TRUE, >> >> indices=resample.indices(n=nrow(surv_obj), >> method="boot", sample.n=50)) >> >> # Get error predictions from peperr >> prederr.coxglmnet <- perr(peperr.coxglmnet) >> >> # Integrated prediction error Brier score calculation >> ipec.coxglmnet<-ipec(prederr.coxglmnet, >> eval.times=peperr.coxglmnet$attribute, response=surv_obj) >> >> >> ## C-Index calculation 50 iter bootstrapping >> >> for (i in 1:50){ >> print (paste("Iteration:",i)) >> train <- sample(1:nrow(dataset), nrow(dataset), replace = >> TRUE) ## >> random sampling with replacement >> # create a dataframe for trainSet with time, status and >> selected >> variables in binary representation for evaluation in pec >> reformat_trainSet <- reformat_dataSet [train,] >> >> >> # glmnet.cox only with meaningful features selected by stepwise >> bidirectional AIC feature selection >> glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ >> .,data=reformat_dataSet),direction="both") >> >> selectedVarCox <- >> predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] >> reformat_testSet <- >> as.data.frame(cbind(surv_obj,selectedVarCox)) >> reformat_testSet <- reformat_dataSet [-train,] >> >> >> # compute c-index (Harrell's) for cox-glmnet models >> if (is.null(glmnet.cox.meaningful)){ >> cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) >> }else{ >> cIndexCoxglmnet <- c(cIndexCoxglmnet, >> 1-rcorr.cens(predict(glmnet.cox.meaningful, >> reformat_testSet),Surv(reformat_testSet$time,reformat_testSet >> $status))[1]) >> } >> } >> >> #Get average C-Index >> cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) >> >> #create a list of all the objects generated >> >> assign >> (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li >> st(glmnet.obj), >> >> selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), >> >> glmnet >> .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c >> oxglmnet), >> cIndex=cIndex)) >> >> # save image of the workspace after each iteration >> save.image("final_subgroup_analysissubgroup_analysis.RData") >> >> >> ______________________________________________ >> > R-help@ > mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius, MD > Alameda, CA, USA > > ______________________________________________ > R-help@ > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Interperting-results-of-glmnet-and-coxph-plot-Brier-score-and-Harrel-s-C-Index-am-I-doing-something--tp4677166p4677175.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.