Following Bill's suggestion, I created a 1000 case 900 predictor file in
SPSS.
I then ran a TREE using CHAID which is one of the tree methods available
in SPSS.
To see the algorithms
go to www.spss.com/support
click <login to online tech support> (left pane)
click <login> (main pane)
login as "guest" password "guest" click "ok"
click <statistics documentation> (left pane)
click <algorithms> (main pane)
There are 7 links that start with "tree".
It does seem like a lot of predictors. Depending on the nature of your
problem, you might try some form of factor analysis (PCA or PFA?) to see
if you can create factor score or summative scales.
With a not very powerful PC under XP, the syntax below the sig block
runs in about 3 to 4 minutes for the tree in 35-45 seconds for the DFA.
Art Kendall
Social Research Consultants
new file.
* this program creates 5 "Likert" variables .
set seed = 5743269.
input program.
vector predictor (900,f6.2).
loop id = 1 to 1000.
loop #k = 1 to 900.
compute predictor(#k) = rv.normal(0,1).
end loop.
end case.
end loop.
end file.
end input program.
formats id (f3).
recode id (1 thru 500 =1)(501 thru 1000=2) into group.
formats group (f1).
var level group (nominal).
* Classification Tree.
TREE group [n] BY predictor1 to predictor900 [s]
/TREE
DISPLAY=TOPDOWN
NODES=STATISTICS
BRANCHSTATISTICS=YES
NODEDEFS=YES
SCALE=AUTO
/DEPCATEGORIES
USEVALUES=[VALID]
/PRINT
MODELSUMMARY
CLASSIFICATION
RISK
/RULES
NODES=TERMINAL
SYNTAX=SPSS
TYPE=SCORING
/METHOD
TYPE=CHAID
/GROWTHLIMIT
MAXDEPTH=10
MINPARENTSIZE=100
MINCHILDSIZE=50
/VALIDATION
TYPE=NONE
OUTPUT=BOTHSAMPLES
/CHAID
ALPHASPLIT=0.05
ALPHAMERGE=0.05
SPLITMERGED=NO
CHISQUARE=PEARSON
CONVERGE=0.001
MAXITERATIONS=100
ADJUST=BONFERRONI
INTERVALS=10.
set workspace 61480.
DISCRIMINANT
/GROUPS=group(1 2)
/VARIABLES=predictor1 to predictor900
/ANALYSIS ALL
/METHOD=WILKS
/FIN= 3.84
/FOUT= 2.71
/PRIORS EQUAL
/HISTORY
/CLASSIFY=NONMISSING POOLED .
William Shannon wrote:
Here is a simple experiment that can be done easily in R.
1. Simulate a dataset consisting of 1,000 data points and 900
covariates where each covariate value comes from a normal(0,1) (or any
other distribution) -- everything independent from each other.
2. Randomly assign the first 500 data points to group 1 and the second
500 data points to group 2
3. Fit your favorite discriminator to predict these two groups and see
how well you can with random data.
4. After identifying the best fitting model removes those covariates
and redo the analysis.
I predict you will be able to discriminate the two groups well through
several iterations of this procedure. If we can discriminate well
with noise then we should be cautious about saying that in the real
problem the discriminator is real and not noise.
Bill
*/Peter Flom <[EMAIL PROTECTED]>/* wrote:
William Shannon wrote
<<<
I am unaware of SPINA and am downloading party now to look into
that software. I generally have used rpart (because Salford is so
expensive) but have never dealt with this many variables with rpart.
>>>
party is very cool. Hothorn has a couple papers where he gets
into the theory. The essential idea is to try to provide
significance testing for trees.
<<<
Do you have anyway to reduce the number of covariates before
partitioning? I would be concerned about the curse of
dimensionality with 900 variables and 1,000 data points. It would
be very easy to find excellent classifiers based on noise. Some
suggest that a split data set (train on one subset randomly
selected from the 1,000 data points and test on the remaining)
overcomes this. However, if X by chance due to the curse of
dimensionality discriminates well than it will discriminate well
in both the training and test data sets.
Can you reduce the 900 covariates by PCA or perhaps use an upfront
stepwise linear discriminant analysis with a high P value
threshold to retain the covariate (say p = .2). We have a paper
where we proposed and tested a genetic algorithm to reduce the
number of variables in microarray data that I can send you in a
couple of weeks when I get back to St. Louis. It is being
published in Sept. in the Interface Proceedings.
>>>
We can reduce the number of variables to about 500 relatively
easily. Further reduction is hard. We don't want to use
principal components because our goal is to get a method that uses
relatively few of the independent variables, and PCA makes linear
combinations of all the variables.
I am not sure I follow your point about a variable discriminating
well due to the curse of dimensionality even on the test data. I
had been in the 'some suggest' camp, which, on intuition, feels
right. But if it's not right, that would be good to know.
Thanks for our help, and I look forward to reading your paper
Peter L. Flom, PhD
Brainscope, Inc.
212 263 7863 (MTW)
212 845 4485 (Th)
917 488 7176 (F)
---------------------------------------------- CLASS-L list.
Instructions:
http://www.classification-society.org/csna/lists.html#class-l
---------------------------------------------- CLASS-L list.
Instructions:
http://www.classification-society.org/csna/lists.html#class-l
----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l