> On 20 Sep 2016, at 11:34, Michael Haenlein <haenl...@escpeurope.eu> wrote:
> Dear all,
> I am trying to estimate a lm model with one continuous dependent variable
> and 11 independent variables that are all categorical, some of which have
> many categories (several dozens in some cases).
If I’m not wrong, ( I assume that categorical variables are in factor form) lm
will pick the most crowded categories and will try to fit a linear model over
them. (This might be wrong, please correct me somebody)
> I am not interested in statistical inference to a larger population. The
> objective of my model is to find a way to best predict my continuous
> variable within the sample.
The best pick would be a CART ( Classification and Reg. Tree, rpart) or CIT
(Conditional Inference Tree, ctree) model to predict continous response
variable by categorical variables. Please, see new partykit (old party) package
> When I run the lm model I evidently get many regression coefficients that
> are not significant. Is there some way to automatically combine levels of a
> categorical variable together if the regression coefficients for the
> individual levels are not significant?
> My idea is to find some form of grouping of the different categories that
> allows me to work with less levels while keeping or even improving the
> quality of predictions.
I also want to mention cforest here, you can measure the importance of your
predictor variables. I would recommend partykit package for categorical
predictors, but also you can give it a try to rpart.
> [[alternative HTML version deleted]]
> Rfirstname.lastname@example.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Remail@example.com mailing list -- To UNSUBSCRIBE and more, see
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.