Bill's experiment should yield a very nearly perfect discrimination
in one iteration of the process he describes.
It's well known that the two group discriminant analysis problem he
defines is equivalent to multiple linear regression predicting one
dependent variable with (in this case) 900 independent
variables. The resulting R^2 (R-squared) will have an expected value
virtually equal to 1.0 (900/999= .9009), which would translate into a
(nominally) near perfect discriminant analysis. What needs to be
done is to correct the R^2 for attenuation-- in which case, under the
circumstances described, the expected ADJUSTED R^2 would be zero
(0.0). There are ways to do the discriminant analysis (whether two
group or multigroup) correcting for number of parameters (independent
variables), and are no doubt ways to do so in the tree software
problem you're concerned with as well.
Doug Carroll
At 10:18 AM 7/2/2007, William Shannon wrote:
Here is a simple experiment that can be done easily in R.
1. Simulate a dataset consisting of 1,000 data points and 900
covariates where each covariate value comes from a normal(0,1) (or
any other distribution) -- everything independent from each other.
2. Randomly assign the first 500 data points to group 1 and the
second 500 data points to group 2
3. Fit your favorite discriminator to predict these two groups and
see how well you can with random data.
4. After identifying the best fitting model removes those covariates
and redo the analysis.
I predict you will be able to discriminate the two groups well
through several iterations of this procedure. If we can
discriminate well with noise then we should be cautious about saying
that in the real problem the discriminator is real and not noise.
Bill
Peter Flom <[EMAIL PROTECTED]> wrote:
William Shannon wrote
<<<
I am unaware of SPINA and am downloading party now to look into that
software. I generally have used rpart (because Salford is so
expensive) but have never dealt with this many variables with rpart.
>>>
party is very cool. Hothorn has a couple papers where he gets into
the theory. The essential idea is to try to provide significance
testing for trees.
<<<
Do you have anyway to reduce the number of covariates before
partitioning? I would be concerned about the curse of
dimensionality with 900 variables and 1,000 data points. It would
be very easy to find excellent classifiers based on noise. Some
suggest that a split data set (train on one subset randomly selected
from the 1,000 data points and test on the remaining) overcomes
this. However, if X by chance due to the curse of dimensionality
discriminates well than it will discriminate well in both the
training and test data sets.
Can you reduce the 900 covariates by PCA or perhaps use an upfront
stepwise linear discriminant analysis with a high P value threshold
to retain the covariate (say p = .2). We have a paper where we
proposed and tested a genetic algorithm to reduce the number of
variables in microarray data that I can send you in a couple of
weeks when I get back to St. Louis. It is being published in Sept.
in the Interface Proceedings.
>>>
We can reduce the number of variables to about 500 relatively
easily. Further reduction is hard. We don't want to use principal
components because our goal is to get a method that uses relatively
few of the independent variables, and PCA makes linear combinations
of all the variables.
I am not sure I follow your point about a variable discriminating
well due to the curse of dimensionality even on the test data. I
had been in the 'some suggest' camp, which, on intuition, feels
right. But if it's not right, that would be good to know.
Thanks for our help, and I look forward to reading your paper
Peter L. Flom, PhD
Brainscope, Inc.
212 263 7863 (MTW)
212 845 4485 (Th)
917 488 7176 (F)
---------------------------------------------- CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l
---------------------------------------------- CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l
######################################################################
# J. Douglas Carroll, Board of Governors Professor of Management and #
#Psychology, Rutgers University, Graduate School of Management, #
#Marketing Dept., MEC125, 111 Washington Street, Newark, New Jersey #
#07102-3027. Tel.: (973) 353-5814, Fax: (973) 353-5376. #
# Home: 14 Forest Drive, Warren, New Jersey 07059-5802. #
# Home Phone: (908) 753-6441 or 753-1620, Home Fax: (908) 757-1086. #
# E-mail: [EMAIL PROTECTED] #
######################################################################
----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l