It is 0045 Easter Day, and I am a church organist. May not have time
to address all the points you raise. If I leave some out, perhaps
someone else on the list will discuss further...
On Sat, 22 Apr 2000 [EMAIL PROTECTED] wrote:
> The dependent variable is based on the following question.
>
> Recently the problem of overcrowing in the state university system
> in Florida has been the subject of considerable debate. Some have
> suggested the elimination of preferential treatment in college
> admissions as a potential remedy. In your opinion, which of the
> following groups listed should continue receiving preferential
> treatment in the admissions process. (1 = yes, 0 = no)
>
> q45a athelets (1, 0)
> q45b national merit scholars (honor students) (2, 0)
> q45c economically disadvantaged (3, 0)
> q45d historically disadvantaged (4, 0)
> q45e children of wealthy benefactors (5, 0)
> q45f children of university alumni (1, 0)
> q45g ethnic and racial minorities (2, 0)
> q45h students with disabilities (3, 0)
> q45i students with prior criminal records (4, 0)
> q45j students with unique artistic talents (5, 0)
>
> The goal is to determine whether or not a persons choices on the
> 10 questions is predictable in terms of independent variable values
> of education, age, race, income, marital status, political party, sex,
> racial attitudes, etc
Ah. Now I see what you meant by "multivalued". It hadn't registered
earlier. What I would do is first search to see what patterns actually
occur. Saves defining several hundred 2x2 tables that don't exist in the
data, for example. Starting with your variables q45a-q45e, I'd recode as
shown above. (This recoding is not _logically_ necessary, but it sure
makes it easier for humans to identify patterns.) I would then either
re-read the data as a 5-digit number, if that were convenient in the
available software, or construct a 5-digit number from values extant in
the file, thus (using your (1,0) extant codes:
let q45.1 = q45a*10000 + q45b*2000 + q45c*300 + q45d*20 + q45e*5
The possible values of q45.1 are 32:
0 300 2000 2300 10000 10300 12000 12300
5 305 2005 2305 10005 10305 12005 12305
40 340 2040 2340 10040 10340 12040 12340
45 345 2045 2345 10045 10345 12045 12345
(Equivalently, you could define this variable in binary rather than
decimal code:
let q45.1b = q45a*16 + q45b*8 + q45c*4 + q45d*2 + q454e
which produces values from 0 to 31 by consecutive integers; and you may
prefer this. Me, I find the 5-digit codes informative -- I can see at a
glance which items have been checked by the respondent. As a mild
refinement on presentation, the output of a frequency-counting routine
can be edited to substitute "." for every "0", and I find these patterns
even clearer to read & interpret:
. 3.. 2... 23.. 1.... 1.3.. 12... 123..
5 3.5 2..5 23.5 1...5 1.3.5 12..5 123.5
4. 34. 2.4. 234. 1..4. 1.34. 12.4. 1234.
45 345 2.45 2345 1..45 1.345 12.45 12345
Of course, in a system like SPSS you could code the responses in 5-digit
binary and assign value labels as above.
Now do the same for q45f-q45j, to produce q45.2. Cross-classify
these two new variables. (Maximum size of table: 32x32 = 1024 cells.
Actual size: less, because some cells, and probably also some rows and
columns, will be empty. This will of course necessarily be the case if
N < 1000, but even if N = 50,000 it would be surprising if ALL possible
combinations were chosen.)
For the question described, it is not clear to me that any
inferential statistics is necessary. What more does the client need than
to be able to say things like
"Of 3219 respondents, at most 53% favored eliminating
preferential treatment for any one of the ten categories; by category,
53% favored e.p.t. for students with prior criminal records,
48% " " " students with disabilities,
. . .
5% " " " children of university alumni" ?
(I'd list them in descending order of %). In a technical appendix, one
might want to add the 95% margins of error for these various %s, but I
cannot see a need for hypothesis testing. (What _substantive_ hypotheses
would be interesting to test? Answers to this question are NOT in the
form "one dichotomy is independent of another dichotomy", but in language
that makes sense to the client.)
These %s would include those who chose this category only AND those who
chose this category in combination with one or more others. Combinations
that were favored by an interestingly large % should also be mentioned,
of course, but in another paragraph entirely.
I had grumbled:
> > ... So far as I can tell from all that algebra, you're effectively
> > substituting a whole bunch of 2x2 tables for a single RxC table
> > (R = number of rows, C = number of columns) with R>2 and(/or?) C>2.
> > Or, for each of several RxC tables.
>
> Yes. This is the intent. I wanted to reduce the R by C table to a
> series of 2 by 2 tables.
Yes, well, I still don't see why this is advantageous. Or perceived as
advantageous.
> > Why do you not first do the obvious contingency table chi-square
> > to see if there's anything worth following up? (And if I were doing
> > it, the follow-up(s) would be in the RxC format as well.)
>
> It is because the dependent variable is multivalued. A person may
> check 0, 1,2 ,3,4,5,6,7,8,9 or 10 of the preferences. I want to be
> able to identify which of the choices was checked, not just the
> number of them. There is no theory for weighing the choices before
> the analysis is complete. I wanted to construct a method for
> systematically examining all 1023 possible choices of dependent
> variable.
This, however, doesn't require 1023 hypothesis tests. See above.
< snip, discussion of error in chi square formula >
> Do you mean that I've stated correctly that the null hypothesis for
> the chi square test on crosstab is that the variables are
> independent?
Yup. Why do you sound so surprised?
< snip >
> > Here it begins to get sticky. I cannot tell whether you mean the
> > same thing by "interaction" that I would mean. In particular,
> > there seems to be no difference between "interaction variable",
> > in your terms, and "indicator variable", in my terms.
>
> I'm not sure what you mean by "indicator variable". I've only seen
> the term in connection with latent variables in structural equation
> modeling.
Sometimes called ("mis-called", Joe Ward would say) "dummy variables",
value = 1 (if a member of the indicated set) or 0 (if not). More
precisely, I should have written "product of indicator variables" in
the last line of my paragraph just above.
> I'm pretty sure my use of "interaction" is consistent with your use
> of "interaction".
Well, it is customary to _model_ interaction in linear models by
multiplying together the variables whose interaction is to be examined.
But strictly speaking, the "interaction variable" is the [logical?] part
of this product that is not correlated with, or "explained by", those
variables (or by their lower-order interactions). But as remarked above,
I do not see what mileage you can get out of testing hypotheses, and the
interesting information, in my view, are the proportions (or %s) of the
sample who chose various combinations... AND, quite probably, how those
proportions (or %s) change according to characteristics of the
respondents (which I assume to be another part of the enterprise:
certainly MY university would be interested in knowing whether alumni,
faculty, students, potential benefactors, ... expressed similar opinions;
wouldn't EXPECT them to, but to the extent that they did it would
simplify some of the decision-making).
< snip, detailsl of "interactions" ... >
> > > Suppose R is an ordinal level variable with values r1 < r2 < r3 < r4.
> > > Then R is converted to three variables S1,S2,S3 with
> > >
> > > S1 = 1 if R = r1 and S1 = 0 otherwise.
> > > S2 = 1 if R = r1 or r2 and S2 = 0 otherwise.
> > > S3 = 1 if R = r1 or r2 or r3 and S3 = 0 otherwise.
> >
> > Possible, but doesn't seem necessary. Why not leave R as it is instead of
> > constructing dichotomies?
>
> Because I'm focused on reducing the analysis to comparing the
> results of a series of 2 by 2 tables.
Yes, I'd rather thought so. Do you mind my saying that it looks from
here almost like a monomania? ;-)
> The goal will be to have the computer analyse the 2 by 2 tables
> and give summary results to the researcher.
The way you phrase this puts me in mind of a saying of a Canadian
colleague of mine, with respect to some computer-generated statistical
analyses:
"untouched by the human mind".
< snip >
> > (You defined a "Col pct" as t/D, which is not a % but a proportion.)
>
> Right. As a mathematician I completely ignore the distinction
> between % and proportion.
Hmph. I _hope_ you don't ignore the distinction between where one places
the decimal point.
< snip, details of chi-square formula >
> > > The significance number that is calculated for a statistic is the
> > > predicted probability that the null hypothesis is true.
>
> > I very much doubt it. This certainly does not conform to the standard
> > statistical definition of "level of significance", which I assume to be
> > what you want us to understand by "significance number".
>
> I would like to discuss this point more. Why does it not conform to
> the standard statistical definition of level of significance?
Read on ...
> We use the p-value calculated for a statistic by comparing it to a
> preset alpha. Alpha is the probability of a type one error.
Some there are who argue against using fixed, preset alpha.
But you correctly describe the process.
> A type one error is rejecting the null hypothesis when it is true.
O.K. so far.
> It seemed intuitive to me that we had to say p was the predicted
> probability of the null hypothesis being true,
We don't _have_ to say anything at all; but what we do say
should be either true, or logically arguable. This is neither.
> and that we would reject the null
> hypothesis if this predicted probability fell below alpha.
This is correct: if for "this predicted probability", meaning
"Pr{null hypothesis is true}" you substitute
"observed Pr{Type I error}" or, equivalently,
"Pr{departure this large or larger from H_null | H_null true}".
> Does this make sense to you, or am I falling into some standard
> trap of beginning statistics students?
The latter, I'm afraid.
The purpose in invoking a probability at all is to be able to put some
kind of bound on the probability that the decision one reports is wrong.
Oversimplifying, as usual, there are two ways one can be wrong: in
rejecting a null hypothesis that is true, and in accepting a null
hypothesis that is not true. Sometimes this is represented schematically
thus:
True state of the universe ("known but to God"):
Investigator's Hypothesis true Hypothesis false
decision:
Reject hypothesis Error I Correct decision
Accept hypothesis Correct decision Error II
Some folks insist on "Fail to reject" instead of "Accept". This is
proper when (as is often the case) one cannot specify a probability
distribution associated with "Hypothesis false". One function of the
specification of a null hypothesis is to be able to specify a sampling
distribution for the statistic being observed in the state "Hypothesis
true". This then makes it possible to describe (estimate) the
probability that one's data are consistent with the null hypothesis, and
to reject the null if that probability is persuasively low (viz., less
than alpha). But this probability is NOT "the predicted probability
that the null hypothesis is true" -- it is the CONDITIONAL probability
of observing data like this, or more distant from the value specified in
the null hypothesis than this, IF the null hypothesis be true.
No conditional probability can describe the probability of its own
condition.
< snip, the rest, most of which is based on the logical
error(s?) addressed above >
It's 2:15 and past my bedtime.
-- DFB.
------------------------------------------------------------------------
Donald F. Burrill [EMAIL PROTECTED]
348 Hyde Hall, Plymouth State College, [EMAIL PROTECTED]
MSC #29, Plymouth, NH 03264 603-535-2597
184 Nashua Road, Bedford, NH 03110 603-471-7128
===========================================================================
This list is open to everyone. Occasionally, less thoughtful
people send inappropriate messages. Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.
For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================