Date sent: Sat, 22 Apr 2000 13:31:50 -0400 (EDT)
From: "Donald F. Burrill" <[EMAIL PROTECTED]>
To: Kermit Rose <[EMAIL PROTECTED]>
Copies to: [EMAIL PROTECTED]
Subject: Re: categorical data analysis
Hello Donald!
>
> [ KR: Please post any response to the edstat list as well as to me.
> I may not have the leisure to continue this conversation, and others on
> the list may have better advice for you in any case. -- DFB. ]
OK. edstat is included on the reply list.
>
> To begin with, you have not told us what the professor in criminology is
> trying to find out, and why; without that information, no-one can offer
> you (or the professor) useful advice on data analysis.
>
The dependent variable is based on the following question.
Recently the problem of overcrowing in the state university system
in Florida has been the subject of considerable debate. Some have
suggested the elimination of preferential treatment in college
admissions as a potential remedy. In your opinion, which of the
following groups listed should continue receiving preferential
treatment in the admissions process. (1 = yes, 0 = no)
q45a athelets
q45b national merit scholars (honor students)
q45c economically disadvantaged
q45d historically disadvantaged
q45e children of wealthy benefactors
q45f children of university alumni
q45g ethnic and racial minorities
q45h students with disabilities
q45i students with prior criminal records
q45j students with unique artistic talents
The goal is to determine whether or not a persons choices on the
10 questions is predictable in terms of independent variable values
of education, age, race, income, marital status, political party, sex,
racial attitudes, etc
> Your proposed procedure seems unnecessarily cumbersome. So far as I can
> tell from all that algebra, you're effecitively substituting a whole bunch
> of 2x2 tables for a single RxC table (R = number of rows, C = number of
> columns) with R>2 and(/or?) C>2. Or, for each of several RxC tables.
Yes. This is the intent. I wanted to reduce the R by C table to a
series of 2 by 2 tables.
> Why do you not first do the obvious contingency table chi-square
> to see if there's anything worth following up? (And if I were doing it,
> the follow-up(s) would be in the RxC format as well.)
It is because the dependent variable is multivalued. A person may
check 0, 1,2 ,3,4,5,6,7,8,9 or 10 of the preferences. I want to be
able to identify which of the choices was checked, not just the
number of them. There is no theory for weighing the choices before
the analysis is complete. I wanted to construct a method for
systematically examing all 1023 possible choices of dependent
variable.
> Your #1 null hypothesis is that two dichotomized variables are
> independent. I don't believe the test statistic you propose -- more about
> that later -- but this is the formal null hyp. for the usual contingency
> table chi square.
>
I can show the derivation of the test statistic. I made a mistype in
typing the formula and have given the correct version below.
Do you mean that I've stated correctly that the null hypothesis for
the chi square test on crosstab is that the variables are
independent?
> > The dependent variable is a multivalued categorical variable.
> So far, so good.
>
> > The model dependent variable is an interaction variable. It is the
> > interaction of some subset of the 10 two-valued dummy variables
> > representing the dependent variable. The 10 values of the dependent
> > variable are choices of preferential treatment for affirmative action.
> Here it begins to get sticky. I cannot tell whether you mean the
> same thing by "interaction" that I would mean. In particular,
> there seems to be no difference between "interaction variable",
> in your terms, and "indicator variable", in my terms.
I'm not sure what you mean by "indicator variable". I've only seen
the term in connection with latent variables in structual equation
modeling.
I'm pretty sure my use of "interaction" is consistent with your use
of "interaction".
Suppose I chose the 1st, 3rd, and 5th dependent variables as my
subset to represent the model dependent variable. I chose the
word "interaction" because I expected readers would know I meant
the product of the chosen dependent variables. Since each
dependent variable is 0 or 1, the product will be 1 if and only if
every one of the chosen dependent variables is 1.
That is, I would call the model dependent variable true (meaning =
1) if it corresponded to dependent variables 1,3 and 5 and all three
of dependent variables 1,3,5 were equal to 1. For this model I
would not care what dependent variables 2,4,6,7,8,9,10 were.
When I inplement this in programming I would of course figure out
efficient ways to calculate the frequencies. It will not be trivial, but I
have a general idea how to do it.
>
> > The model independent variable is also an interaction variable. It is
> > the interaction of two-valued dummy variables representing some subset
> > of the predicting variables.
>
> < snip, details of standard construction of indicator variables >
>
> > Suppose R is an ordinal level variable with values r1 < r2 < r3 < r4.
> > Then R is converted to three variables S1,S2,S3 with
> >
> > S1 = 1 if R = r1 and S1 = 0 otherwise.
> > S2 = 1 if R = r1 or r2 and S2 = 0 otherwise.
> > S3 = 1 if R = r1 or r2 or r3 and S3 = 0 otherwise.
>
> Possible, but doesn't seem necessary. Why not leave R as it is instead of
> constructing dichotomies?
>
Because I'm focused on reducing the analysis to comparing the
results of a series of 2 by 2 tables.
The goal will be to have the computer analyse the 2 by 2 tables
and give summary results to the researcher.
> < snip, details of argumentation >
>
> > We define parameters t,I,D and N as follows.
> >
> > t is the number of cases where both the model independent and model
> > dependent variable are true.
> >
> > I is the number of cases where the model independent variable is true.
> >
> > D is the number of cases where the model Dependent variable is true.
> >
> > N is the total number of cases.
>
> Somewhat more briefly, and a good deal more clearly for me, you are
> defining a 2x2 table thus:
>
> Independent Dependent variable
> variable 1 0 | TOTAL
> ----------------+-----------------------+--------
> 1 | t . | I
> 0 | . . | .
> ----------------+-----------------------+--------
> TOTAL | D . | N
>
> with the values marked "." determined by subtraction.
Yes. This is an excellent summary.
> (You defined a "Col pct" as t/D, which is not a % but a proportion.)
>
Right. As a mathematician I completely ignore the distinction
between % and proportion.
> < snip, tedious definition of above table >
>
> < snip, definitions of covariance, variances, r^2;
> which I did not really follow >
>
I could show the derivations which might make the definitions
meaningful if you wish. I put them in only to show that it was
possible to calculate them. I did not expect readers to follow why
the formulas were valid.
> > Chisquare of crosstab of model independent variable with model
> > dependent variable is
> >
> > (t - D*I/N)*( N/[D*I] + N/[I*(N-D)] + N/[D*(N-I)] + N/[(N-I)*(N-D)] )
>
> No, I don't think so. Your formula is of the form A*B. B is
> nonnegative. A may be positive or negative, so the product is positive or
> negative depending on A. Chisquare cannot be negative.
>
You make an excellent observation. In fact the factor which you
call A is a mistype. In my derivation, I have for A the term
(t - D* I/N)^2
So both A and B are supposed to be positive.
The corrected formula is:
(t - D*I/N)^2 *( N/[D*I] + N/[I*(N-D)] + N/[D*(N-I)] + N/[(N-I)*(N-D)] )
> > The significance number that is calculated for a statistic is the
> > predicted probability that the null hypothesis is true.
>
> I very much doubt it. This certainly does not conform to the standard
> statistical definition of "level of significance", which I assume to be
> what you want us to understand by "significance number".
>
I would like to discuss this point more. Why does it not conform to
the standard statistical definition of level of significance? We use
the p-value calculated for a statistic by comparing it to a preset
alpha. Alpha is the probability of a type one error. A type one
error is rejecting the null hypothesis when it is true. It seemed
intuitive to me that we had to say p was the predicted probability
of the null hypothesis being true, and that we would reject the null
hypothesis if this predicted probability fell below alpha.
Does this make sense to you, or am I falling into some standard
trap of beginning statistics students?
> > There are two different null hypotheses of interest.
> >
> > null_1:
> > The dependent variable does not depend on independent variable.
>
> Discussed above.
>
> > The significance of null_1 is (D - t)/N
>
> Can't imagine why it would be.
Null 1 is the hypothesis that the model independent variable does
not imply the model dependent variable. This means that Null 1 is
the hypothesis that the model independent is 1 and the model
dependent variable is 0. The proportion of cases for which this is
true is (D - t)/N.
> In any case, one doesn't speak of
> "the significance of an hypothesis" (null or otherwise), one speaks of the
> significance of a test, or of a test statistic; and one is thereby
> referring to a sampling distribution of the test statistic in question.
> You seem to have no sampling distribution in mind.
>
I had noticed this. It just felt more expedient to say significance of
the null hypothesis. I will try to explain why.
Because I had concluded that the standard p-value which we call
significance was really the predicted probability of the null
hypotheses being true,
then
(1) I had concluded that the probility of the null hypothesis being
true could be calculated directly from the data rather than from a
presumed theoretical sampling distribution. The data sample itself
is the sampling distribution I had in mind.
(2) There were two null hypothesis and therefore two different tests,
distinguished only by having different null hypothesis.
I still think it a valid generalization in this case. My test statistic (
predicted probability of the null hypothesis being true) is its own
significance!
> > null_2:
> >
> > There is not a bidirectional relationship between the independent
> > variable and the dependent variable.
>
> What do you mean by a "bidirectional relationship"? Whatever you mean, it
> cannot be different, so far as I can see, from null_1, for a 2x2 table.
By bidirectional relationship I mean logical equivalence.
Null_2 is the hypothesis that the model independent variable = 0 if
and only if the model dependent variable = 1.
The negation of Null_2 is that the model independent variable = the
model dependent variable.
This corresponds to wanting the independent variable to imply the
dependent variable, and the converse that the dependent variable
also implies the independent variable. That's why I called it
bidirectional.
This is a stronger requirement that you might wish for some
associations.
> You've only got one degree of freedom for detecting whether there is ANY
> relationship of any kind; you have zero d.f. for detecting whether a
> relationship, once found, is of one kind or another.
>
I have one degree of freedom for the Null_1 analysis.
The Null_2 analysis is really a separate analysis that uses the
same one degree of freedom.
If Null_1 is true then NUll_2 is also true.
If Null_2 is false, then Null_1 is also false.
> ------------------------------------------------------------------------
> Donald F. Burrill [EMAIL PROTECTED]
> 348 Hyde Hall, Plymouth State College, [EMAIL PROTECTED]
> MSC #29, Plymouth, NH 03264 603-535-2597
> 184 Nashua Road, Bedford, NH 03110 603-471-7128
>
>
>
Kermit
[EMAIL PROTECTED]
===========================================================================
This list is open to everyone. Occasionally, less thoughtful
people send inappropriate messages. Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.
For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================