I'd suggest you start by using lda() or qda() from MASS, benefits being that
(a) if the frequencies in the sample do not reflect the frequencies in the target population, you can set 'prior' to mirror the target frequencies. The issue is, perhaps, is your odd person odd in a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in a 1000:1000:50 population? You can also vary the prior to see what the effect is. If however you set a large prior probability for a group that is poorly represented, results will be 'noisy'. Note the use of 'classwt' for the prior probablities for randomForest().
(b) You can plot second versus first discriminant function scores, to get a direct graphical representation of results. Other discrimination techniques may have to use an ordination technique or even lds() or qds() on a >2 dimensional representation of results, in order to get a scatterplot. [cf MDSplot() for randomForest()]
John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.
On 5 Nov 2004, at 10:18 PM, [EMAIL PROTECTED] wrote:
From: Berton Gunter <[EMAIL PROTECTED]>
Date: 5 November 2004 5:08:38 AM
To: "'Dan Bolser'" <[EMAIL PROTECTED]>, "'R-help'" <[EMAIL PROTECTED]>
Cc: Subject: RE: [R] highly biased PCA data?
Dan:
1) There is no guarantee that PCA will show separate groups, of course, as
that is not its purpose, although it is frequently a side effect.
2) If you were to use a classification method of some sort (discriminant
analysis, neural nets, SVM's, model=based classification, ...), my
understanding is that yes, indeed, severely unbalanced group membership
would, indeed, affect results. A guess is that Bayesian or other methods
that could explicitly model the prior membership probabilities would do
better. To make it clear why, suppose that there was a 99.9% preference of
"dog" and .05% each of the others. Than your datasets would have almost no
information on how covariates could distinguish the classes and the best
classifier would be to call everything a "dog" no matter what values the
covariates had.
I presume experts will have more and better to say about this.
-- Bert Gunter
[mailto:[EMAIL PROTECTED] On Behalf Of Dan Bolser Sent: Thursday, November 04, 2004 9:41 AM To: R mailing list Subject: [R] highly biased PCA data?
Hello, supposing that I have two or three clear categories
for my data, lets say pet preferece across fish, cat, dog. Lets say most
people rate their preference as being mostly one of the categories.
I want to do pca on the data to see three 'groups' of people,
one group for fish, one for cat and one for dog. I would like to see
the odd person who likes both or all three in the (appropriate) middle of
the other main groups.
Will my data be affected by the fact that I have interviewed 1000 dog owners, 100 cat owners and 10 fish owners? (assuming that each scale of preference has an equal range).
Cheers, dan.
______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
