Following Bill's suggestion, I created a 1000 case 900 predictor file in SPSS. I then ran a TREE using CHAID which is one of the tree methods available in SPSS.

To see the algorithms
go to www.spss.com/support
click <login to online tech support> (left pane)
click <login> (main pane)
login as "guest" password "guest"  click "ok"
click <statistics documentation> (left pane)
click <algorithms> (main pane)
There are 7 links that start with "tree".

It does seem like a lot of predictors. Depending on the nature of your problem, you might try some form of factor analysis (PCA or PFA?) to see if you can create factor score or summative scales.

With a not very powerful PC under XP, the syntax below the sig block runs in about 3 to 4 minutes for the tree in 35-45 seconds for the DFA.



Art Kendall
Social Research Consultants



new file.
* this program creates 5 "Likert" variables  .
set seed = 5743269.
input program.
vector predictor (900,f6.2).
loop id = 1 to 1000.
loop #k  = 1 to 900.
compute predictor(#k) = rv.normal(0,1).
end loop.
end case.
end loop.
end file.
end input program.
formats id (f3).
recode id (1 thru 500 =1)(501 thru 1000=2) into group.
formats group (f1).
var level group (nominal).
* Classification Tree.
TREE group [n] BY predictor1 to predictor900 [s]
/TREE
  DISPLAY=TOPDOWN
  NODES=STATISTICS
  BRANCHSTATISTICS=YES
  NODEDEFS=YES
  SCALE=AUTO
/DEPCATEGORIES
  USEVALUES=[VALID]
/PRINT
  MODELSUMMARY
  CLASSIFICATION
  RISK
/RULES
  NODES=TERMINAL
  SYNTAX=SPSS
  TYPE=SCORING
/METHOD
  TYPE=CHAID
/GROWTHLIMIT
  MAXDEPTH=10
  MINPARENTSIZE=100
  MINCHILDSIZE=50
/VALIDATION
  TYPE=NONE
  OUTPUT=BOTHSAMPLES
/CHAID
  ALPHASPLIT=0.05
  ALPHAMERGE=0.05
  SPLITMERGED=NO
  CHISQUARE=PEARSON
  CONVERGE=0.001
  MAXITERATIONS=100
  ADJUST=BONFERRONI
  INTERVALS=10.
set workspace 61480.
DISCRIMINANT
 /GROUPS=group(1 2)
 /VARIABLES=predictor1 to predictor900
 /ANALYSIS ALL
  /METHOD=WILKS
  /FIN= 3.84
  /FOUT= 2.71
 /PRIORS  EQUAL
  /HISTORY
 /CLASSIFY=NONMISSING POOLED .


William Shannon wrote:
Here is a simple experiment that can be done easily in R.

1. Simulate a dataset consisting of 1,000 data points and 900 covariates where each covariate value comes from a normal(0,1) (or any other distribution) -- everything independent from each other.

2. Randomly assign the first 500 data points to group 1 and the second 500 data points to group 2

3. Fit your favorite discriminator to predict these two groups and see how well you can with random data.

4. After identifying the best fitting model removes those covariates and redo the analysis.


I predict you will be able to discriminate the two groups well through several iterations of this procedure. If we can discriminate well with noise then we should be cautious about saying that in the real problem the discriminator is real and not noise.

Bill
*/Peter Flom <[EMAIL PROTECTED]>/* wrote:

    William Shannon  wrote
    <<<
    I am unaware of SPINA and am downloading party now to look into
    that software.  I generally have used rpart (because Salford is so
    expensive) but have never dealt with this many variables with rpart.
    >>>

    party is very cool.  Hothorn has a couple papers where he gets
    into the theory.  The essential idea is to try to provide
    significance testing for trees.

    <<<
    Do you have anyway to reduce the number of covariates before
    partitioning?  I would be concerned about the curse of
    dimensionality with 900 variables and 1,000 data points.  It would
    be very easy to find excellent classifiers based on noise.  Some
    suggest that a split data set (train on one subset randomly
    selected from the 1,000 data points and test on the remaining)
    overcomes this.  However, if X by chance due to the curse of
    dimensionality discriminates well than it will discriminate well
    in both the training and test data sets.

    Can you reduce the 900 covariates by PCA or perhaps use an upfront
    stepwise linear discriminant analysis with a high P value
    threshold to retain the covariate (say p = .2).  We have a paper
    where we proposed and tested a genetic algorithm to reduce the
    number of variables in microarray data that I can send you in a
    couple of weeks when I get back to St. Louis.  It is being
    published in Sept. in the Interface Proceedings.
    >>>

    We can reduce the number of variables to about 500 relatively
    easily.  Further reduction is hard.  We don't want to use
    principal components because our goal is to get a method that uses
    relatively few of the independent variables, and PCA makes linear
combinations of all the variables.
    I am not sure I follow your point about a variable discriminating
    well due to the curse of dimensionality even on the test data.  I
    had been in the 'some suggest' camp, which, on intuition, feels
    right.  But if it's not right, that would be good to know.

    Thanks for our help, and I look forward to reading your paper


     Peter L. Flom, PhD
     Brainscope, Inc.
     212 263 7863 (MTW)
     212 845 4485 (Th)
     917 488 7176 (F)


---------------------------------------------- CLASS-L list.
    Instructions:
    http://www.classification-society.org/csna/lists.html#class-l


---------------------------------------------- CLASS-L list. Instructions: http://www.classification-society.org/csna/lists.html#class-l

----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l

Reply via email to