I'd suggest you start by using lda() or qda() from MASS,
benefits being that

(a) if the frequencies in the sample do not reflect the frequencies
in the target population, you can set 'prior' to mirror the target
frequencies.  The issue is, perhaps, is your odd person odd in
a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in
a 1000:1000:50 population?  You can also vary the prior to see
what the effect is.  If however you set a large prior probability for
a group that is poorly represented, results will be 'noisy'.  Note
the use of 'classwt' for the prior probablities for randomForest().

(b) You can plot second versus first discriminant function scores,
to get a direct graphical representation of results.
Other discrimination techniques may have to use an ordination
technique or even lds() or qds() on a >2 dimensional representation
of results, in order to get a scatterplot.
[cf MDSplot() for randomForest()]

John Maindonald             email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 5 Nov 2004, at 10:18 PM, [EMAIL PROTECTED] wrote:

From: Berton Gunter <[EMAIL PROTECTED]>
Date: 5 November 2004 5:08:38 AM
To: "'Dan Bolser'" <[EMAIL PROTECTED]>, "'R-help'" <[EMAIL PROTECTED]>
Cc: Subject: RE: [R] highly biased PCA data?


Dan:

1) There is no guarantee that PCA will show separate groups, of course, as
that is not its purpose, although it is frequently a side effect.


2) If you were to use a classification method of some sort (discriminant
analysis, neural nets, SVM's, model=based classification, ...), my
understanding is that yes, indeed, severely unbalanced group membership
would, indeed, affect results. A guess is that Bayesian or other methods
that could explicitly model the prior membership probabilities would do
better. To make it clear why, suppose that there was a 99.9% preference of
"dog" and .05% each of the others. Than your datasets would have almost no
information on how covariates could distinguish the classes and the best
classifier would be to call everything a "dog" no matter what values the
covariates had.


I presume experts will have more and better to say about this.

-- Bert Gunter


[mailto:[EMAIL PROTECTED] On Behalf Of Dan Bolser
Sent: Thursday, November 04, 2004 9:41 AM
To: R mailing list
Subject: [R] highly biased PCA data?

Hello, supposing that I have two or three clear categories
for my data, lets say pet preferece across fish, cat, dog. Lets say most
people rate their preference as being mostly one of the categories.


I want to do pca on the data to see three 'groups' of people,
one group for fish, one for cat and one for dog. I would like to see
the odd person who likes both or all three in the (appropriate) middle of
the other main groups.


Will my data be affected by the fact that I have interviewed 1000 dog
owners, 100 cat owners and 10 fish owners? (assuming that
each scale of preference has an equal range).

Cheers,
dan.

______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to