[R] Variable Selection for Logistic Regression

Manish MAHESHWARI Thu, 17 Dec 2015 07:29:25 -0800

Hi,

I have a dataset with approx 400K Rows and 900 columns with a single dependent 
variable of 0/1 flag. The independent variables are both categorical and 
numerical. I have looked as SO/Cross Validated Posts but couldn't get an answer 
for this.


Since I cannot try all possible combinations of variables or even attempt 
single model with all 900 columns, I am planning to create independent models 
of each variable using something like below -

out = NULL
xnames = colnames(train)[!colnames(train) %in% ignoredcols]
for (f in xnames) {
    glmm = glm(train$conversion_flag ~ train[,f] - 1 , family = binomial)
    out = rbind.fill(out,as.data.frame(cbind(f,fmsb::NagelkerkeR2(glmm)[2]$R2)))
    out = rbind.fill(out,as.data.frame(cbind(f,'AIC',summary(glmm)$aic)))
}

This will give me the individual AIC and pseudo R2 for each of the variables. 
Post that I plan to select the variables with the best scores for both AIC and 
pseudoR2. Does this approach make sense?

I obviously will use a nfold cross validation in the final model to ensure 
accuracy and avoid over fitting. However before I reach that I plan to use the 
above to select which variables to use.

Thanks,
Manish
CONFIDENTIAL NOTE:
The information contained in this email is intended only...{{dropped:11}}

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Variable Selection for Logistic Regression

Reply via email to