At 09:55 AM 7/3/2007, William Shannon wrote:
Very impressive on recovering the model.
Peter Flom <[EMAIL PROTECTED]> wrote:
With signal = 1, party split only on signal, eventually getting to 4
nodes, with splits at -.237, .602, and 1.29. The percent correctly
placed was about .8 for the extreme nodes and .6 for the middle nodes
With signal = .5, it made only one split, on signal at .732, with
about 55% to 60% correctly placed
Peter L. Flom, PhD
Brainscope, Inc.
212 263 7863 (MTW)
212 845 4485 (Th)
917 488 7176 (F)
-----Original Message-----
From: William Shannon
[<mailto:[EMAIL PROTECTED]>mailto:[EMAIL PROTECTED]
Sent: Tue 7/3/2007 6:18 AM
To: Classification, clustering, and phylogeny estimation
Cc: Peter Flom
Subject: Re: Tree software
That is surprising. I generated the same data as you did and ran
library(rpart)
a=rpart(as.factor(GROUP)~., data=treetestdata)
and obtained a tree with 24 terminal nodes.
Add a true signal variable to your data and let us know how party
does. This can be done by adding the third line to your data generation code:
treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900),
Sigma = diag(900)))
treetestdata$GROUP <- rep(c("G1","G2"), each=500)
treetestdata$SIGNAL <- c(rnorm(500, mean=0), rnorm(500, mean=1))
For mean = 1 rpart split on SIGNAL first plus other variables, and
for mean=0.5 rpart split on others first and SIGNAL eventually.
Bill
Peter Flom <[EMAIL PROTECTED]> wrote: RE: Tree
software William Shannon wrote
<<<
Do you have anyway to reduce the number of covariates before
partitioning? I would be concerned about the curse of
dimensionality with 900 variables and 1,000 data points. It would
be very easy to find excellent classifiers based on noise. Some
suggest that a split data set (train on one subset randomly
selected from the 1,000 data points and test on the remaining)
overcomes this. However, if X by chance due to the curse of
dimensionality discriminates well than it will discriminate well in
both the training and test data sets.
>>>>
and suggested the following experiment:
<<<<
1. Simulate a dataset consisting of 1,000 data points and 900
covariates where each covariate value comes from a normal(0,1) (or
any other distribution) -- everything independent from each other.
2. Randomly assign the first 500 data points to group 1 and the
second 500 data points to group 2
3. Fit your favorite discriminator to predict these two groups and
see how well you can with random data.
4. After identifying the best fitting model removes those
covariates and redo the analysis.
>>>
I did this.
Using party, there were no viable splits in the original data.
Here is my code:
<<<
treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900),
Sigma = diag(900)))
treetestdata$GROUP <- rep(c("G1","G2"), each=500)
treetest <- ctree(as.factor(GROUP) ~ ., data = treetestdata)
plot(treetest)
>>>
the result was a single node, with 500 subjects in each of the two groups.
There was, thus, no way to do steps 3 or 4.
Now, it's true that these variables are uncorrelated and mine in
real life are correlated. I can play around with that a little
bit, but don't have time to do so right now. If others are
interested in playing around with this structure, I'd appreciate
seeing any results.
---------------------------------------------------------------------------------------------------------------------------------------
This last result is just what you'd expect based on my argument sent
in some earlier e-mail regarding two group discriminant analysis. I
presume that about 90.09% (+ or - a few percentage points) of the
subjects were correctly classified! Doug Carroll.
---------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------- CLASS-L list.
Instructions:
<http://www.classification-society.org/csna/lists.html#class-l>http://www.classification-society.org/csna/lists.html#class-l
---------------------------------------------- CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l
######################################################################
# J. Douglas Carroll, Board of Governors Professor of Management and #
#Psychology, Rutgers University, Graduate School of Management, #
#Marketing Dept., MEC125, 111 Washington Street, Newark, New Jersey #
#07102-3027. Tel.: (973) 353-5814, Fax: (973) 353-5376. #
# Home: 14 Forest Drive, Warren, New Jersey 07059-5802. #
# Home Phone: (908) 753-6441 or 753-1620, Home Fax: (908) 757-1086. #
# E-mail: [EMAIL PROTECTED] #
######################################################################
----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l