Thanks Doug. That is very helpful.
I view this problem geometrically and think of the 1,000 data points (2 groups)
as distributed in 900 dimensional space and the likelihood of finding random
hyperplanes is huge. In this sense any discrimination procedure will almost
surely find a separation.
Two issues:
1. In molecular data (e.g., microarrays) the goal is not necessarily predictive
ability but rather gene (variable) selection (though many people analyzing this
data don't distinguish these two activities).
Doug, how do we select the handful of covariates from the large-P-small-N data
problem like desribed here?
(Peter, you may not be interested in covariate selection since your data will
always consist of the 900 measurements.)
2. The software 'party' did amazingly well at not fitting noise and selecting
the signal as reported by Peter (I am out of town so have not had a chance to
look into this yet). Any thoughts?
Torsten -- I included you since we are discussing party a bit on class-l list
server.
Bill Shannon
314-704-8725
"J. Douglas Carroll" <[EMAIL PROTECTED]> wrote: Bill's experiment should
yield a very nearly perfect discrimination in one iteration of the process he
describes.
It's well known that the two group discriminant analysis problem he defines is
equivalent to multiple linear regression predicting one dependent variable with
(in this case) 900 independent variables. The resulting R^2 (R-squared) will
have an expected value virtually equal to 1.0 (900/999= .9009), which would
translate into a (nominally) near perfect discriminant analysis. What needs to
be done is to correct the R^2 for attenuation-- in which case, under the
circumstances described, the expected ADJUSTED R^2 would be zero (0.0). There
are ways to do the discriminant analysis (whether two group or multigroup)
correcting for number of parameters (independent variables), and are no doubt
ways to do so in the tree software problem you're concerned with as well.
Doug Carroll
At 10:18 AM 7/2/2007, William Shannon wrote:
Here is a simple experiment that can be done easily in R.
1. Simulate a dataset consisting of 1,000 data points and 900 covariates where
each covariate value comes from a normal(0,1) (or any other distribution) --
everything independent from each other.
2. Randomly assign the first 500 data points to group 1 and the second 500
data points to group 2
3. Fit your favorite discriminator to predict these two groups and see how
well you can with random data.
4. After identifying the best fitting model removes those covariates and redo
the analysis.
I predict you will be able to discriminate the two groups well through several
iterations of this procedure. If we can discriminate well with noise then we
should be cautious about saying that in the real problem the discriminator is
real and not noise.
Bill
Peter Flom <[EMAIL PROTECTED]> wrote:
William Shannon wrote
<<<
I am unaware of SPINA and am downloading party now to look into that
software. I generally have used rpart (because Salford is so expensive) but
have never dealt with this many variables with rpart.
>>>
party is very cool. Hothorn has a couple papers where he gets into the
theory. The essential idea is to try to provide significance testing for trees.
<<<
Do you have anyway to reduce the number of covariates before partitioning?
I would be concerned about the curse of dimensionality with 900 variables and
1,000 data points. It would be very easy to find excellent classifiers based
on noise. Some suggest that a split data set (train on one subset randomly
selected from the 1,000 data points and test on the remaining) overcomes this.
However, if X by chance due to the curse of dimensionality discriminates well
than it will discriminate well in both the training and test data sets.
Can you reduce the 900 covariates by PCA or perhaps use an upfront stepwise
linear discriminant analysis with a high P value threshold to retain the
covariate (say p = .2). We have a paper where we proposed and tested a genetic
algorithm to reduce the number of variables in microarray data that I can send
you in a couple of weeks when I get back to St. Louis. It is being published
in Sept. in the Interface Proceedings.
>>>
We can reduce the number of variables to about 500 relatively easily.
Further reduction is hard. We don't want to use principal components because
our goal is to get a method that uses relatively few of the independent
variables, and PCA makes linear combinations of all the variables.
I am not sure I follow your point about a variable discriminating well due
to the curse of dimensionality even on the test data. I had been in the 'some
suggest' camp, which, on intuition, feels right. But if it's not right, that
would be good to know.
Thanks for our help, and I look forward to reading your paper
Peter L. Flom, PhD
Brainscope, Inc.
212 263 7863 (MTW)
212 845 4485 (Th)
917 488 7176 (F)
---------------------------------------------- CLASS-L list. Instructions:
http://www.classification-society.org/csna/lists.html#class-l
---------------------------------------------- CLASS-L list. Instructions:
http://www.classification-society.org/csna/lists.html#class-l
######################################################################
# J. Douglas Carroll, Board of Governors Professor of Management and #
#Psychology, Rutgers University, Graduate School of Management, #
#Marketing Dept., MEC125, 111 Washington Street, Newark, New Jersey #
#07102-3027. Tel.: (973) 353-5814, Fax: (973) 353-5376. #
# Home: 14 Forest Drive, Warren, New Jersey 07059-5802. #
# Home Phone: (908) 753-6441 or 753-1620, Home Fax: (908) 757-1086. #
# E-mail: [EMAIL PROTECTED] #
######################################################################----------------------------------------------
CLASS-L list. Instructions:
http://www.classification-society.org/csna/lists.html#class-l
----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l