Re: Cluster analysis, decision trees and other classification with very many variables

Carlos Alberto Estombelo Montesco Tue, 24 Apr 2007 10:10:17 -0700

Dear Peter Flom,

Some coments about your email:


- When you say "classify ... neurological problems" I think that it is
general and probably there are characteristics (in signal) that can define
which neurological problems, for example are your talkin about spikes
related to schizophrenia?

- It is interesting to do a Principal componene analysis over data, but here
you can obtain : (ortogonal) components  ordered by their variances, if you
have a good signal-to-noise-ratio (SNR) I think that there are no problems.
But if you have low SNR, you need to be carefully about the high variance of
the noise compared with the signal of interest and then lost the
characteristics of the rela signal of interes, and when you cluster it can
be appear spread, interfering the clustering.

- When you use PCA probably you have the most correlated cases but not
independent, because uncorrelation not meaning independence.

- If you have "the diagnosis of the people " why didn´t you choose the most
representative (a percentage of the set) and train and algoritm of
classification, after that test with other little percentage, and at the end
you can clasify the rest of data ?

Best Regards,

Carlos  Estombelo-Montesco


2007/4/24, Peter Flom <[EMAIL PROTECTED]>:


 Hello

(note that this is the same Peter Flom at a different address with a new
e-mail and a new job)

I have a data set with about 800 people and about 1000 variables.  The
variables are all 'features' of EEG data that have been extracted by subject
matter experts in neurology as being potentially useful. All variables have
been standardized to mean 0, sd 1. There are many high correlations among
them.

We are interested in many aspects of this data - one primary aim is to use
the EEG data to better classify people who have neurological problems.  Two
methods that seem particularly relevant to this list are clustering and
decision trees.  I've done a bit of both, but always on data sets with FAR
fewer variables (e.g. about 10 variables).  Especially with regard to
clustering, I was thinking of doing a principal components analysis prior to
the cluster analysis (perhaps with SAS PRINCOMP, FACTOR, or VARCLUS).

With regard to trees, I've done some 'basic' analysis of other data sets
using R's 'party' and 'rpart' packages.  With those data sets, however, the
main goal was explanation, and so, I did not explore bagging and boosting
and such.  Any pointers or introductions to that literature would be most
welcome (preferably at a not TOO high mathematical level - I had some
calculus many years ago, but am much more interested in applications than in
'theorem-proof' material).

I will be exploring this data set for quite some time, so am willing to
invest some effort to learn best practices, and am also willing to try a
variety of methods.

Finally, as to why I am looking at both trees and clusters - partly, we
know the diagnosis of the people (hence trees are useful) but we also know
that there are difficulties with the diagnoses, and that these difficulties
may be amenable to exploration with sophisticated methods


Thanks in advance

Peter Flom
Brainscope, Inc.

---------------------------------------------- CLASS-L list. Instructions:
http://www.classification-society.org/csna/lists.html#class-l





--
+--------------------------------------------------------------------------------------+
 Carlos Alberto Estombelo Montesco
 PhD. Student in Physics Applied to Medicine and
Biology
.......................................................................................
 University of Sao Paulo
 Department of Physics and Mathematics
 School of Philosophy, Sciences and Letters of Ribeirão Preto
 Av. Bandeirantes, 3900 CEP: 14040-901 Ribeirão Preto, SP, Brazil
 fax  : +55 16 3602 4887
 email: [EMAIL PROTECTED]
+--------------------------------------------------------------------------------------+

----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l

Re: Cluster analysis, decision trees and other classification with very many variables

Reply via email to