Jim,
With regard to variable and model selection, you might consider using Bayesian 
model averaging (bma program) or some sort of shrinkage (lars or lasso2 
programs).

Scott Millis




________________________________
 From: Jin Minming <jminm...@yahoo.com>
To: "r-help@r-project.org" <r-help@r-project.org>; SR Millis <aa3...@wayne.edu> 
Sent: Monday, January 30, 2012 11:30 AM
Subject: Re: [R] Fw: Variable selection based on both training and testing data

Dear Scott,

I am so sorry that I think I just sent an empty email to you.
Thanks a lot for your advice.

The problem is that we do not have sufficient prior knowledge for the 
regression form and even appropriate inputs. We need try to find some possible 
regression equations, then add our explanation to them.  So we need explore a 
lot of options.  The two input datasets are very different in nature and they 
are from two locations.  Hence, it can be used for testing purpose although it 
may turn out to be that there is not an appropriate regression due to the 
intrinsic difference in these two datasets. 

In fact, if I can extract the models used (not only the final model) in stepAIC 
function, then it will be easier to add some simple scripts to calculate R2 or 
RMSE for both datasets. 

Thanks,

Jim


--- On Mon, 30/1/12, SR Millis <aa3...@wayne.edu> wrote:

> From: SR Millis <aa3...@wayne.edu>
> Subject: [R] Fw: Variable selection based on both training and testing data
> To: "r-help@r-project.org" <r-help@r-project.org>
> Date: Monday, 30 January, 2012, 14:57
> 
> 
> From: SR Millis <srmil...@yahoo.com>
> To: Jin Minming <jminm...@yahoo.com>
> 
> Sent: Monday, January 30, 2012 9:25 AM
> Subject: Re: [R] Variable selection based on both training
> and testing data
>  
> 
> Jim,
> 
> First, stepwise methods for variable selection should be
> avoided.  Frank Harrell (in Regression Modeling Strategies)
> discusses this at length.
> 
> Second, splitting a dataset into training and validation
> sets is generally not a good idea unless you have a really
> large sample, eg, > 20,000.  As Harrell has discussed,
> split-sample validation does not provide external
> validation, is terribly inefficient, and is arbitrary. 
> It's better to specify your model a priori and use the
> bootstrap to obtain an estimate of your model's
> over-optimism.  Bootstrapping can be implemented with
> Harrell's rms package in R.
> 
> Scott
>  
> ~~~~~~~~~~~
> Scott R Millis, PhD, ABPP, CStat, PStat®
> Professor
> Wayne State University School of Medicine
> Email:  aa3...@wayne.edu
> Email:  srmil...@yahoo.com
> Tel: 313-993-8085
> 
> 
> ________________________________
> 
> To: r-help@r-project.org
> 
> Sent: Monday, January 30, 2012 8:14 AM
> Subject: [R] Variable selection based on both training and
> testing data
> 
> Dear all,
> 
> The variable selection in regression is usually determined
> by the training data using AIC or F value, such as stepAIC.
> Is there some R package that can consider both the training
> and test dataset? For example, I have two separate training
> data and test data. Firstly, a regression model is obtained
> by using training data, and then this model is tested by
> using test data. This process continues in order to find
> some possible optimal models in terms of RMSE or R2 for both
> training and test data. 
> 
> Thanks,
> 
> Jim
> 
> ______________________________________________
> R-help@r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
>  reproducible code.
>     [[alternative HTML version deleted]]
> 
> 
> -----Inline Attachment Follows-----
> 
> ______________________________________________
> R-help@r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
>
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to