I am no expert on this sort of matters, but that has never stopped me from tossing in my $0.02...
As Gabor and Bert hinted, this is what I would try: Run randomForest on the data, using sampsize=c(10, 10, 10) and importance=TRUE, for example. Then take the few most important variables with respect to each class and maybe do PCA on those to see if you can see separation. HTH, Andy > From: Dan Bolser > > On Thu, 4 Nov 2004, Berton Gunter wrote: > > > > >Dan: > > > > > >1) There is no guarantee that PCA will show separate groups, > of course, as > >that is not its purpose, although it is frequently a side effect. > > > >2) If you were to use a classification method of some sort > (discriminant > >analysis, neural nets, SVM's, model=based classification, ...), my > >understanding is that yes, indeed, severely unbalanced group > membership > >would, indeed, affect results. A guess is that Bayesian or > other methods > >that could explicitly model the prior membership > probabilities would do > >better. To make it clear why, suppose that there was a 99.9% > preference of > >"dog" and .05% each of the others. Than your datasets would > have almost no > >information on how covariates could distinguish the classes > and the best > >classifier would be to call everything a "dog" no matter > what values the > >covariates had. > > > >I presume experts will have more and better to say about this. > > Sounds interesting. Thanks very much for the input. Just out > of curiosity, > given that I can make my data more uniform (less biased), how > could I best > generate a 2d plot to encapsulate the clusters (and inter cluster > relationships)? > > Actually I am thinking of a 2d density. > > > > > >-- Bert Gunter > >Genentech Non-Clinical Statistics > >South San Francisco, CA > > > >"The business of the statistician is to catalyze the > scientific learning > >process." - George E. P. Box > > > > > > > >> -----Original Message----- > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED] On Behalf Of Dan Bolser > >> Sent: Thursday, November 04, 2004 9:41 AM > >> To: R mailing list > >> Subject: [R] highly biased PCA data? > >> > >> > >> Hello, supposing that I have two or three clear categories > >> for my data, > >> lets say pet preferece across fish, cat, dog. Lets say most > >> people rate > >> their preference as being mostly one of the categories. > >> > >> I want to do pca on the data to see three 'groups' of people, > >> one group > >> for fish, one for cat and one for dog. I would like to see > >> the odd person > >> who likes both or all three in the (appropriate) middle of > >> the other main > >> groups. > >> > >> Will my data be affected by the fact that I have > interviewed 1000 dog > >> owners, 100 cat owners and 10 fish owners? (assuming that > >> each scale of > >> preference has an equal range). > >> > >> Cheers, > >> dan. > >> > >> ______________________________________________ > >> [EMAIL PROTECTED] mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide! > >> http://www.R-project.org/posting-guide.html > >> > > > > ______________________________________________ > [EMAIL PROTECTED] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
