In article <[EMAIL PROTECTED]>,
Rich Ulrich <[EMAIL PROTECTED]> wrote:
>On 21 Nov 2003 12:44:23 -0800, [EMAIL PROTECTED] (Chih-Mao Hsieh) wrote:
>> Dear Edstat-listers,
>> I have 8 variables per observation, all count data
>> (integers>0), and I want to be able to run an R factor
>> analysis to obtain factor scores. The data have the
>> following attributes:
>> (1) Hundreds of thousands of observations at my disposal, from which I can sample
>> if nec.
>> (2) Significantly non-normal, apparently not very amenable to transformations
Normality is essentially irrelevant for the validity of
factor models. It is linearity, and it is this which
essentially excludes count data.
>> (3) Significant portions of the observations have zeros "across the board"
>I want to discuss your (3). For data that I use (symptoms, etc.),
>there is an ordinary, 1st Principal Component where everything is
>positively correlated.
I can see no justification for doing principal components.
It depends heavily on the scales of the variables, and using
correlations instead of covariances usually introduces huge
sampling problems. The use of principal components between
two sets of variables does make sense, assuming linearity is
appropriate.
I'm doing factoring on patients, where none
>of them are all-zeros. If I had a sub-sample with zeros across the
>board, I think nobody would mind if I dropped them, without much
>further justification.
>Now, Zeros are a special concern. It happens, at times, that the
>gap between 0 and 1 could be considered as much larger
>than any of the other counts -- number of prior heart attacks,
>number of pregnancies, and so on.
What you should be saying is that one should think and
formulate a probability model, instead of letting the
computer, using procedures, attempt to do your thinking
for you. This means that you probably cannot use a
cookbook, but the crude linear models for which there
are simple techniques are not reasonable answers to the
real questions.
It *might* be sensible
>to consider transforming all your 8 variables to 0/1, and
>considering the associations among those. I'm certainly
>not saying that this would be your only analysis, but I can
>imagine data where those crosstabulations could be the
>most interesting way to look at the data.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
[EMAIL PROTECTED] Phone: (765)494-6054 FAX: (765)494-0558
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================