Re: Tree software

J. Douglas Carroll Tue, 03 Jul 2007 18:23:30 -0700

At 09:55 AM 7/3/2007, William Shannon wrote:

Very impressive on recovering the model.
Peter Flom <[EMAIL PROTECTED]> wrote:
With signal = 1, party split only on signal, eventually getting to 4nodes, with splits at -.237, .602, and 1.29. The percent correctlyplaced was about .8 for the extreme nodes and .6 for the middle nodes
With signal = .5, it made only one split, on signal at .732, withabout 55% to 60% correctly placed
Peter L. Flom, PhD
Brainscope, Inc.
212 263 7863 (MTW)
212 845 4485 (Th)
917 488 7176 (F)




-----Original Message-----
From: William Shannon[<mailto:[EMAIL PROTECTED]>mailto:[EMAIL PROTECTED]
Sent: Tue 7/3/2007 6:18 AM
To: Classification, clustering, and phylogeny estimation
Cc: Peter Flom
Subject: Re: Tree software

That is surprising.  I generated the same data as you did and ran

library(rpart)
a=rpart(as.factor(GROUP)~., data=treetestdata)

and obtained a tree with 24 terminal nodes.
Add a true signal variable to your data and let us know how partydoes. This can be done by adding the third line to your data generation code:
treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900),Sigma = diag(900)))
 treetestdata$GROUP <- rep(c("G1","G2"), each=500)
  treetestdata$SIGNAL <- c(rnorm(500, mean=0), rnorm(500, mean=1))
For mean = 1 rpart split on SIGNAL first plus other variables, andfor mean=0.5 rpart split on others first and SIGNAL eventually.
Bill
Peter Flom <[EMAIL PROTECTED]> wrote: RE: Treesoftware William Shannon wrote
 <<<
Do you have anyway to reduce the number of covariates beforepartitioning? I would be concerned about the curse ofdimensionality with 900 variables and 1,000 data points. It wouldbe very easy to find excellent classifiers based on noise. Somesuggest that a split data set (train on one subset randomlyselected from the 1,000 data points and test on the remaining)overcomes this. However, if X by chance due to the curse ofdimensionality discriminates well than it will discriminate well inboth the training and test data sets.
 >>>>

 and suggested the following experiment:
 <<<<
1. Simulate a dataset consisting of 1,000 data points and 900covariates where each covariate value comes from a normal(0,1) (orany other distribution) -- everything independent from each other.
2. Randomly assign the first 500 data points to group 1 and thesecond 500 data points to group 2
3. Fit your favorite discriminator to predict these two groups andsee how well you can with random data.
4. After identifying the best fitting model removes thosecovariates and redo the analysis.
 >>>

 I did this.

 Using party, there were no viable splits in the original data.

 Here is my code:


 <<<
treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900),Sigma = diag(900)))
 treetestdata$GROUP <- rep(c("G1","G2"), each=500)
 treetest <- ctree(as.factor(GROUP) ~ ., data = treetestdata)

 plot(treetest)
 >>>

 the result was a single node, with 500 subjects in each of the two groups.

 There was, thus, no way to do steps 3 or 4.
Now, it's true that these variables are uncorrelated and mine inreal life are correlated. I can play around with that a littlebit, but don't have time to do so right now. If others areinterested in playing around with this structure, I'd appreciateseeing any results.
---------------------------------------------------------------------------------------------------------------------------------------
This last result is just what you'd expect based on my argument sentin some earlier e-mail regarding two group discriminant analysis. Ipresume that about 90.09% (+ or - a few percentage points) of thesubjects were correctly classified! Doug Carroll.

---------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------- CLASS-L list.Instructions:<http://www.classification-society.org/csna/lists.html#class-l>http://www.classification-society.org/csna/lists.html#class-l
---------------------------------------------- CLASS-L list.Instructions: http://www.classification-society.org/csna/lists.html#class-l




  ######################################################################
  # J. Douglas Carroll, Board of Governors Professor of Management and #
  #Psychology, Rutgers University, Graduate School of Management,      #
  #Marketing Dept., MEC125, 111 Washington Street, Newark, New Jersey  #
  #07102-3027.  Tel.: (973) 353-5814, Fax: (973) 353-5376.             #
  # Home: 14 Forest Drive, Warren, New Jersey 07059-5802.              #
  # Home Phone: (908) 753-6441 or 753-1620, Home Fax: (908) 757-1086.  #
  # E-mail: [EMAIL PROTECTED]                                   #

######################################################################

----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l

Re: Tree software

Reply via email to