Re: Tree software

William Shannon Mon, 02 Jul 2007 07:38:13 -0700

Here is a simple experiment that can be done easily in R.

1. Simulate a dataset consisting of 1,000 data points and 900 covariates where 
each covariate value comes from a normal(0,1) (or any other distribution) -- 
everything independent from each other.

2. Randomly assign the first 500 data points to group 1 and the second 500 data 
points to group 2

3. Fit your favorite discriminator to predict these two groups and see how well 
you can with random data.

4. After identifying the best fitting model removes those covariates and redo 
the analysis.

I predict you will be able to discriminate the two groups well through several 
iterations of this procedure.  If we can discriminate well with noise then we 
should be cautious about saying that in the real problem the discriminator is 
real and not noise.

Bill

Peter Flom <[EMAIL PROTECTED]> wrote:     RE: Tree software     William Shannon 
 wrote
 <<<
 I am unaware of SPINA and am downloading party now to look into that software. 
 I generally have used rpart (because Salford is so expensive) but have never 
dealt with this many variables with rpart.
 >>>

 party is very cool.  Hothorn has a couple papers where he gets into the 
theory.  The essential idea is to try to provide significance testing for trees.

 <<<
 Do you have anyway to reduce the number of covariates before partitioning?  I 
would be concerned about the curse of dimensionality with 900 variables and 
1,000 data points.  It would be very easy to find excellent classifiers based 
on noise.  Some suggest that a split data set (train on one subset randomly 
selected from the 1,000 data points and test on the remaining) overcomes this.  
However, if X by chance due to the curse of dimensionality discriminates well 
than it will discriminate well in both the training and test data sets.

 Can you reduce the 900 covariates by PCA or perhaps use an upfront stepwise 
linear discriminant analysis with a high P value threshold to retain the 
covariate (say p = .2).  We have a paper where we proposed and tested a genetic 
algorithm to reduce the number of variables in microarray data that I can send 
you in a couple of weeks when I get back to St. Louis.  It is being published 
in Sept. in the Interface Proceedings.
 >>>

 We can reduce the number of variables to about 500 relatively easily.  Further 
reduction is hard.  We don't want to use principal components because our goal 
is to get a method that uses relatively few of the independent variables, and 
PCA makes linear combinations of all the variables. 

 I am not sure I follow your point about a variable discriminating well due to 
the curse of dimensionality even on the test data.  I had been in the 'some 
suggest' camp, which, on intuition, feels right.  But if it's not right, that 
would be good to know.

 Thanks for our help, and I look forward to reading your paper

  Peter L. Flom, PhD
  Brainscope, Inc.
  212 263 7863 (MTW)
  212 845 4485 (Th)
  917 488 7176 (F)

  ---------------------------------------------- CLASS-L list. Instructions: 
http://www.classification-society.org/csna/lists.html#class-l

----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l

Re: Tree software

Reply via email to