Very impressive on recovering the model.  

Peter Flom <[EMAIL PROTECTED]> wrote:     RE: Tree software     With signal = 
1, party split only on signal, eventually getting to 4 nodes, with splits at 
-.237, .602, and 1.29.  The percent correctly placed was about .8 for the 
extreme nodes and .6 for the middle nodes
 
 With signal = .5, it made only one split, on signal at .732, with about 55% to 
60% correctly placed
 
 Peter L. Flom, PhD
 Brainscope, Inc.
 212 263 7863 (MTW)
 212 845 4485 (Th)
 917 488 7176 (F)
 
 
 
 
 -----Original Message-----
 From: William Shannon [mailto:[EMAIL PROTECTED]
 Sent: Tue 7/3/2007 6:18 AM
 To: Classification, clustering, and phylogeny estimation
 Cc: Peter Flom
 Subject: Re: Tree software
 
 That is surprising.  I generated the same data as you did and ran
 
 library(rpart)
 a=rpart(as.factor(GROUP)~., data=treetestdata)
 
 and obtained a tree with 24 terminal nodes.
 
 Add a true signal variable to your data and let us know how party does.  This 
can be done by adding the third line to your data generation code:
 
  treetestdata <-  as.data.frame(mvrnorm(n = 1000, mu=rep(0,900), Sigma = 
diag(900)))
  treetestdata$GROUP <- rep(c("G1","G2"), each=500)
   treetestdata$SIGNAL <- c(rnorm(500, mean=0), rnorm(500, mean=1))
 
 For mean = 1 rpart split on SIGNAL first plus other variables, and for 
mean=0.5 rpart split on others first and SIGNAL eventually.
 
 Bill
 
 Peter Flom <[EMAIL PROTECTED]> wrote:     RE: Tree software     William 
Shannon  wrote
  <<<
  Do you have anyway to reduce the number of covariates before partitioning?  I 
would be concerned about the curse of dimensionality with 900 variables and 
1,000 data points.  It would be very easy to find excellent classifiers based 
on noise.  Some suggest that a split data set (train on one subset randomly 
selected from the 1,000 data points and test on the remaining) overcomes this.  
However, if X by chance due to the curse of dimensionality discriminates well 
than it will discriminate well in both the training and test data sets.
  >>>>
 
  and suggested the following experiment:
  <<<<
  1. Simulate a dataset consisting of 1,000 data points and 900 covariates 
where each covariate value comes from a normal(0,1) (or any other distribution) 
-- everything independent from each other.
 
  2. Randomly assign the first 500 data points to group 1 and the second 500 
data points to group 2
 
  3. Fit your favorite discriminator to predict these two groups and see how 
well you can with random data.
 
  4. After identifying the best fitting model removes those covariates and redo 
the analysis.
  >>>
 
  I did this.
 
  Using party, there were no viable splits in the original data.
 
  Here is my code:
 
 
  <<<
  treetestdata <-  as.data.frame(mvrnorm(n = 1000, mu=rep(0,900), Sigma = 
diag(900)))
  treetestdata$GROUP <- rep(c("G1","G2"), each=500)
  treetest <- ctree(as.factor(GROUP) ~ ., data = treetestdata)
 
  plot(treetest)
  >>>
 
  the result was a single node, with 500 subjects in each of the two groups.
 
  There was, thus, no way to do steps 3 or 4.
 
  Now, it's true that these variables are uncorrelated and mine in real life 
are correlated.  I can play around with that a little bit, but don't have time 
to do so right now.  If others are interested in playing around with this 
structure, I'd appreciate seeing any results.
 
 
  
   ---------------------------------------------- CLASS-L list. Instructions: 
http://www.classification-society.org/csna/lists.html#class-l
 
  
  

----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l

Reply via email to