No, for regression trees collinearity is a non-issue, because it is not a linear
procedure. Having variables that are linearly dependent (even exactly so) merely
widens the scope of choice that the algorithm has to make cuts.
I'm not sure what you mean by "Multicollinear variables should appear as alternate
splits". Do you mean that every second split should be in one variable of a
particular set? Perhaps you mean "alternative" instead of "alternate"? In either
case I think you are worrying over nothing. Just go ahead and do the tree-based model
analysis and don't worry about it.
Here is a little picture that might clarify things. Suppose Latitude and Longitude
are two variables on which the algorithm may choose to split. This means that splits
in these geographical variables can only occur in a North-South or an East-West
direction. Let's suppose you add in two extra variables that are completely dependent
on the first, namely
LatPlusLong <- Latitude + Longitude
LatMinusLong <- Latitude - Longitude
and now offer all four variables as potential split variables. Now the algorithm may
split North-South, East-West, NorthEast-SouthWest or NorthWest-SouthEast. All you
have done is increase the scope of choice for the algorithm to make splits. Not only
does the linear dependence not matter, but I'd argue it could be a pretty good thing.
One serious message to take from this as well, though, is to use regression trees for
prediction. Don't read too much into the variables that the algorithm has chosen to
use at any stage.
Bill Venables.
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jean-Noel
Sent: Monday, 9 February 2004 8:25 PM
To: [EMAIL PROTECTED]
Subject: [R] Recursive partitioning with multicollinear variables
Dear all,
I would like to perform a regression tree analysis on a dataset with multicollinear
variables (as climate variables often are). The questions that I am asking are:
1- Is there any particular statistical problem in using multicollinear variables in a
regression tree?
2- Multicollinear variables should appear as alternate splits. Would it be more
accurate to present these alternate splits in the results of the analysis or apply a
variable selection or reduction procedure before the regression tree? Thank you in
advance,
Jean-Noel Candau
INRA - Unit� de Recherches Foresti�res M�diterran�ennes
Avenue A. Vivaldi
84000 AVIGNON
Tel: (33) 4 90 13 59 22
Fax: (33) 4 90 13 59 59
______________________________________________
[EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html